Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:
2021-08-02 18:15:48 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:15:48 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:15:48 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:15:48 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq.utils | rank 0: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:15:53 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:15:53 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5612fbcaa000 @ 0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
tcmalloc: large alloc 1625169920 bytes == 0x56135ca8c000 @ 0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
2021-08-02 18:16:00 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-08-02 18:16:00 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:16:00 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:16:00 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:16:00 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:16:01 | INFO | fairseq.trainer | begin training epoch 1
2021-08-02 18:16:11 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2021-08-02 18:16:20 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2021-08-02 18:16:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2021-08-02 18:16:38 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-08-02 18:16:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2021-08-02 18:16:57 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2021-08-02 18:17:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2021-08-02 18:17:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2021-08-02 18:17:30 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
distributed_utils.call_main(cfg, main)
File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
main(cfg, **kwargs)
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
raise e
File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
ignore_grad=is_dummy_batch,
File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
optimizer.backward(loss)
File "/content/R-Drop/fairseq_src/fairseq/optim/fp16_optimizer.py", line 101, in backward
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.
If I run without the --fp16 option, my code fails instead in the following way
2021-08-02 18:27:32 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:27:32 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:27:32 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:27:32 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq.utils | rank 0: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:27:37 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:27:37 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5610c8c0c000 @ 0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
tcmalloc: large alloc 1625169920 bytes == 0x56112a1ee000 @ 0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
2021-08-02 18:27:42 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:27:42 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:27:43 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:27:43 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:27:44 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:28:42 | INFO | train_inner | epoch 001: 100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=117.7, ups=1.72, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.648, clip=100, train_wall=58, gb_free=4.4, wall=66
2021-08-02 18:29:37 | INFO | train_inner | epoch 001: 200 / 253944 loss=10.224, nll_loss=6.292, ppl=78.34, wps=125.7, ups=1.81, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=34.896, clip=100, train_wall=55, gb_free=6.8, wall=121
Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
distributed_utils.call_main(cfg, main)
File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
main(cfg, **kwargs)
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
raise e
File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
ignore_grad=is_dummy_batch,
File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
optimizer.backward(loss)
File "/content/R-Drop/fairseq_src/fairseq/optim/fairseq_optimizer.py", line 99, in backward
loss.backward()
File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.
Any advice on the above?