Hi, I'm trying to run the following command
source setup.sh && runexp anli-part infobert roberta-base 2e-5 32 128 -1 1000 42 1e-5 5e-3 6 0.1 0 4e-2 8e-2 0 3 5e-3 0.5 0.9
But I got the following error.
Traceback:
04/08/2022 19:30:17 - INFO - datasets.anli - Saving features into cached file anli_data/cached_dev_RobertaTokenizer_128_anli-part [took 0.690 s]
04/08/2022 19:30:17 - INFO - filelock - Lock 139893720074960 released on anli_data/cached_dev_RobertaTokenizer_128_anli-part.lock
04/08/2022 19:30:17 - INFO - local_robust_trainer - You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
04/08/2022 19:30:17 - INFO - local_robust_trainer - ***** Running training *****
04/08/2022 19:30:17 - INFO - local_robust_trainer - Num examples = 942069
04/08/2022 19:30:17 - INFO - local_robust_trainer - Num Epochs = 3
04/08/2022 19:30:17 - INFO - local_robust_trainer - Instantaneous batch size per device = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer - Total train batch size (w. parallel, distributed & accumulation) = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer - Gradient Accumulation steps = 1
04/08/2022 19:30:17 - INFO - local_robust_trainer - Total optimization steps = 88320
Iteration: 0%| | 0/29440 [00:00<?, ?it/s]
Epoch: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./run_anli.py", line 395, in <module>
main()
File "./run_anli.py", line 239, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 731, in train
full_loss, loss_dict = self._adv_training_step(model, inputs, optimizer)
File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 1031, in _adv_training_step
outputs = model(**inputs)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/roberta.py", line 345, in forward
inputs_embeds=inputs_embeds,
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/bert.py", line 822, in forward
output_hidden_states=output_hidden_states,
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/bert.py", line 494, in forward
output_attentions,
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/bert.py", line 416, in forward
hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/bert.py", line 347, in forward
hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/InfoBERT/ANLI/models/bert.py", line 239, in forward
mixed_query_layer = self.query(hidden_states)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Traceback (most recent call last):
File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/infobert/bin/python', '-u', './run_anli.py', '--local_rank=0', '--model_name_or_path', 'roberta-base', '--task_name', 'anli-part', '--do_train', '--do_eval', '--data_dir', 'anli_data', '--max_seq_length', '128', '--per_device_train_batch_size', '32', '--learning_rate', '2e-5', '--max_steps', '-1', '--warmup_steps', '1000', '--weight_decay', '1e-5', '--seed', '42', '--beta', '5e-3', '--logging_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--output_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--version', '6', '--evaluate_during_training', '--logging_steps', '500', '--save_steps', '500', '--hidden_dropout_prob', '0.1', '--attention_probs_dropout_prob', '0', '--overwrite_output_dir', '--adv_lr', '4e-2', '--adv_init_mag', '8e-2', '--adv_max_norm', '0', '--adv_steps', '3', '--alpha', '5e-3', '--cl', '0.5', '--ch', '0.9']' returned non-zero exit status 1.
Do you know how to fix this?
Thank you so much.
Other Information:
- OS: Ubuntu 20.04.3 LTS
- GPU: NVIDIA A100
- Python 3.7.13