R-Drop: Regularized Dropout for Neural Networks

Related tags

Deep Learning R-Drop
Overview
Comments
  • R-drop makes my model broken.

    R-drop makes my model broken.

    In my NMT task,I try to let the encoder and decoder to forward twice ,but the kl_loss is too large. Then I tried to compute the mean,but it is too small to have effect. image image

    Can someone help me?

    opened by MayDomine 9
  • Inconsistency for KL loss and CE loss hyper-parameters  and baselines results in GLUE

    Inconsistency for KL loss and CE loss hyper-parameters and baselines results in GLUE

    Inconsistency exits in the code of bert_modeling and roberta_modeling files, i.e. bert loss is like this--> ce(logits1, labels)+ce(logits2,labels)+ 0.5/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.5 here and that in roberta loss is like this--> 0.5*(ce(logits1, labels)+ce(logits2,labels))+ 0.7/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.7 and ce loss also aeveraged ### What are the tricks here???

    In BERT ` alpha = 1.0 for logits in logits_list: if labels is not None: if self.num_labels == 1: # We are doing regression loss_fct = MSELoss() if loss: loss += alpha * loss_fct(logits.view(-1), labels.view(-1)) else: loss = alpha * loss_fct(logits.view(-1), labels.view(-1)) else: loss_fct = CrossEntropyLoss() if loss: loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1)) else: loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if loss is not None:
            if self.num_labels == 1:
                loss_fct = MSELoss()
                loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
            else:
                p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
                q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
    
                kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
                reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
                loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.`
    

    In Roberta ` loss = None if labels is not None: if self.num_labels == 1: # We are doing regression loss_fct = MSELoss() if loss is None: loss = 0.5 * loss_fct(logits_list[0].view(-1), labels.view(-1)) else: loss += 0.5 * loss_fct(logits_list[-1].view(-1), labels.view(-1)) else: loss_fct = CrossEntropyLoss() if loss is None: loss = 0.5 * loss_fct(logits_list[0].view(-1, self.num_labels), labels.view(-1)) else: loss += 0.5 * loss_fct(logits_list[-1].view(-1, self.num_labels), labels.view(-1))

        if loss is not None:
            if self.num_labels == 1:
                loss_fct = MSELoss()
                loss += 0.8 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
            else:
                p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
                q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
    
                kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
                reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')
                
                loss += 0.7 * (kl_loss.sum() + reverse_kl_loss.sum()) / 2`
    
    opened by zhangzhenyu13 5
  • Can not reproduce following the hyperparameter in the paper for finefuning ViT on Cifar100

    Can not reproduce following the hyperparameter in the paper for finefuning ViT on Cifar100

    I run the code provided with hyperparameter lr = 1e-2, alpha = 0.3, dropout = 0.1, resolution = 384*384, 10000 global steps, batch size = 512 yet the result I got is far from the improvement given by the paper image

    opened by NamlessM 4
  • Training configuration for the WMT14 EnDe dataset?

    Training configuration for the WMT14 EnDe dataset?

    Hi, I was trying to reproduce the result on the WMT14 EnDe dataset, but was unable to get BLEU increase as shown in the paper. Could you share the training script for that? Thanks!

    opened by frankang 4
  • How the `warmup steps` affects the performance?

    How the `warmup steps` affects the performance?

    Hi, bro. Thanks for your insightful work. I would like to know the following details. In: https://github.com/dropreg/R-Drop/blob/main/huggingface_transformer_src/README.md The hyperparameter of warmup steps is so weird. How to choose it and how does it affect the performance?

    opened by Doragd 2
  • unable to reproduce results on GLUE

    unable to reproduce results on GLUE

    **Hi, I am trying to reproduce results on GLUE, but obvious lower than paper. I run the code with suggested hyperparameters on 32G V100 - cuda10.2/ubuntu - python 3.6 / pytorch 1.8

    **

    _==> run_task_baseline_CoLA.log <==
    [INFO|trainer.py:1963] 2021-10-26 01:37:38,527 >>   Num examples = 1043
    [INFO|trainer.py:1966] 2021-10-26 01:37:38,527 >>   Batch size = 8
    100%|##########| 131/131 [00:05<00:00, 23.67it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 01:37:44,107 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   epoch                     =       9.97
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_loss                 =     1.7947
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_matthews_correlation =     0.6032
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_runtime              = 0:00:05.57
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples              =       1043
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples_per_second   =    186.945
    
    ==> run_task_baseline_MNLI.log <==
    [INFO|trainer.py:1963] 2021-10-26 08:31:45,749 >>   Num examples = 9832
    [INFO|trainer.py:1966] 2021-10-26 08:31:45,749 >>   Batch size = 8
    100%|##########| 1229/1229 [00:54<00:00, 22.53it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 08:32:40,333 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   epoch                   =      10.09
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_accuracy           =      0.853
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_loss               =     0.8149
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_runtime            = 0:00:54.58
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples            =       9832
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples_per_second =    180.129
    
    ==> run_task_baseline_MRPC.log <==
    100%|##########| 51/51 [00:02<00:00, 23.40it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 01:30:21,104 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   epoch                   =       9.98
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_accuracy           =      0.848
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_combined_score     =     0.8708
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_f1                 =     0.8935
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_loss               =     1.4835
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_runtime            = 0:00:02.22
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples            =        408
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples_per_second =    183.422
    
    ==> run_task_baseline_QNLI.log <==
    [INFO|trainer.py:1963] 2021-10-26 03:19:05,420 >>   Num examples = 5463
    [INFO|trainer.py:1966] 2021-10-26 03:19:05,420 >>   Batch size = 8
    100%|##########| 683/683 [00:29<00:00, 22.80it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 03:19:35,416 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   epoch                   =      10.11
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_accuracy           =     0.9143
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_loss               =     0.5311
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_runtime            = 0:00:29.99
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples            =       5463
    [INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples_per_second =    182.125
    
    ==> run_task_baseline_QQP.log <==
    100%|##########| 5054/5054 [03:42<00:00, 22.72it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 08:26:18,073 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   epoch                   =       9.96
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_accuracy           =      0.912
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_combined_score     =     0.8972
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_f1                 =     0.8824
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_loss               =      0.533
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_runtime            = 0:03:42.46
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples            =      40430
    [INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples_per_second =     181.74
    
    ==> run_task_baseline_RTE.log <==
    [INFO|trainer.py:1963] 2021-10-26 01:28:46,682 >>   Num examples = 277
    [INFO|trainer.py:1966] 2021-10-26 01:28:46,682 >>   Batch size = 8
    100%|##########| 35/35 [00:01<00:00, 24.70it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 01:28:48,142 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   epoch                   =       6.53
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_accuracy           =     0.6462
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_loss               =     1.9563
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_runtime            = 0:00:01.46
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples            =        277
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples_per_second =    189.656
    
    ==> run_task_baseline_SST2.log <==
    [INFO|trainer.py:1963] 2021-10-26 02:33:21,244 >>   Num examples = 872
    [INFO|trainer.py:1966] 2021-10-26 02:33:21,244 >>   Batch size = 8
    100%|##########| 109/109 [00:04<00:00, 23.88it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 02:33:25,854 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   epoch                   =       9.95
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_accuracy           =     0.9255
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_loss               =     0.6675
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_runtime            = 0:00:04.60
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples            =        872
    [INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples_per_second =    189.197
    
    ==> run_task_baseline_STSB.log <==
    100%|##########| 188/188 [00:08<00:00, 22.73it/s]
    [INFO|trainer_pt_utils.py:898] 2021-10-26 01:34:24,178 >> ***** eval metrics *****
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   epoch                   =       9.99
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_combined_score     =     0.8904
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_loss               =     0.9493
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_pearson            =     0.8921
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_runtime            = 0:00:08.31
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples            =       1500
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples_per_second =    180.347
    [INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_spearmanr          =     0.8887_
    
    opened by 1024er 2
  • Unable to preprocess data for summarization

    Unable to preprocess data for summarization

    I followed these instructions:

    git clone https://github.com/dropreg/R-Drop.git
    cd R-Drop/fairseq_src/
    pip install --editable .
    

    and tried to preprocess the data for summarization by running,

    bash script/preprocess.sh
    

    However, I get the following error:

    /users/gpu/samiks/anaconda3/envs/rdrop/bin/python: No module named examples.roberta.multiprocessing_bpe_encoder
    

    It seems multiprocessing_bpe_encoder is missing from this repo. Are we supposed to run the preprocessing with a separate fairseq install?

    opened by samiksome92 2
  • CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

    CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

    Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:

    2021-08-02 18:15:48 | INFO | fairseq_cli.train | task: RDropTranslationTask
    2021-08-02 18:15:48 | INFO | fairseq_cli.train | model: BARTModel
    2021-08-02 18:15:48 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
    2021-08-02 18:15:48 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
    2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
    2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
    2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
    2021-08-02 18:15:53 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
    2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
    2021-08-02 18:15:53 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
    2021-08-02 18:15:53 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
    2021-08-02 18:15:53 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
    tcmalloc: large alloc 1625169920 bytes == 0x5612fbcaa000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
    tcmalloc: large alloc 1625169920 bytes == 0x56135ca8c000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
    2021-08-02 18:16:00 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
    2021-08-02 18:16:00 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
    2021-08-02 18:16:00 | INFO | fairseq.trainer | loading train data for epoch 1
    2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
    2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
    2021-08-02 18:16:00 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
    2021-08-02 18:16:00 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
    2021-08-02 18:16:01 | INFO | fairseq.trainer | begin training epoch 1
    2021-08-02 18:16:11 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
    2021-08-02 18:16:20 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
    2021-08-02 18:16:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
    2021-08-02 18:16:38 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
    2021-08-02 18:16:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
    2021-08-02 18:16:57 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
    2021-08-02 18:17:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
    2021-08-02 18:17:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
    2021-08-02 18:17:30 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-train", line 33, in <module>
        sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
        distributed_utils.call_main(cfg, main)
      File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
        main(cfg, **kwargs)
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
        valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
      File "/usr/lib/python3.7/contextlib.py", line 74, in inner
        return func(*args, **kwds)
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
        log_output = trainer.train_step(samples)
      File "/usr/lib/python3.7/contextlib.py", line 74, in inner
        return func(*args, **kwds)
      File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
        raise e
      File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
        ignore_grad=is_dummy_batch,
      File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
        loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
      File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
        optimizer.backward(loss)
      File "/content/R-Drop/fairseq_src/fairseq/optim/fp16_optimizer.py", line 101, in backward
        loss.backward()
      File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
        allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
    
    

    I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.

    If I run without the --fp16 option, my code fails instead in the following way

    2021-08-02 18:27:32 | INFO | fairseq_cli.train | task: RDropTranslationTask
    2021-08-02 18:27:32 | INFO | fairseq_cli.train | model: BARTModel
    2021-08-02 18:27:32 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
    2021-08-02 18:27:32 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
    2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
    2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
    2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
    2021-08-02 18:27:37 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
    2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
    2021-08-02 18:27:37 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
    2021-08-02 18:27:37 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
    2021-08-02 18:27:37 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
    tcmalloc: large alloc 1625169920 bytes == 0x5610c8c0c000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
    tcmalloc: large alloc 1625169920 bytes == 0x56112a1ee000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
    2021-08-02 18:27:42 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
    2021-08-02 18:27:42 | INFO | fairseq.trainer | loading train data for epoch 1
    2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
    2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
    2021-08-02 18:27:43 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
    2021-08-02 18:27:43 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
    2021-08-02 18:27:44 | INFO | fairseq.trainer | begin training epoch 1
    /content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
      "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
    2021-08-02 18:28:42 | INFO | train_inner | epoch 001:    100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=117.7, ups=1.72, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.648, clip=100, train_wall=58, gb_free=4.4, wall=66
    2021-08-02 18:29:37 | INFO | train_inner | epoch 001:    200 / 253944 loss=10.224, nll_loss=6.292, ppl=78.34, wps=125.7, ups=1.81, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=34.896, clip=100, train_wall=55, gb_free=6.8, wall=121
    Traceback (most recent call last):
      File "/usr/local/bin/fairseq-train", line 33, in <module>
        sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
        distributed_utils.call_main(cfg, main)
      File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
        main(cfg, **kwargs)
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
        valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
      File "/usr/lib/python3.7/contextlib.py", line 74, in inner
        return func(*args, **kwds)
      File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
        log_output = trainer.train_step(samples)
      File "/usr/lib/python3.7/contextlib.py", line 74, in inner
        return func(*args, **kwds)
      File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
        raise e
      File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
        ignore_grad=is_dummy_batch,
      File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
        loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
      File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
        optimizer.backward(loss)
      File "/content/R-Drop/fairseq_src/fairseq/optim/fairseq_optimizer.py", line 99, in backward
        loss.backward()
      File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
        allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
    
    

    I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.

    Any advice on the above?

    opened by paul-chelarescu 2
  • Summarization task fails with 'Trying to backward through the graph a second time'

    Summarization task fails with 'Trying to backward through the graph a second time'

    Hi, by following the instructions verbatim in the readme file, the summarization task defined here will fail with the following error

    RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time., if the following four lines are not removed from here

            if ignore_grad:
                loss *= 0
            with torch.autograd.profiler.record_function("backward"):
                optimizer.backward(loss)             
    

    It seems like these lines are duplicated in this part of the code, causing the error.

    opened by paul-chelarescu 2
  • Question of the proof

    Question of the proof

    In Appendix B, How the equation 9 transfer to equation 10? 截屏2022-07-30 23 12 34

    I think is 2(1-p) but not (1-p). $||w^Tx_i - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i * [1,...,1]^T - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i ||*||C||$ ||.|| is 1-norm. C has p $(1-\frac{1}{p})$, 1-p $1$, hence the $||C|| = p * |1-\frac{1}{p}| + 1-p = 2(1-p)$

    opened by SYSUykLin 1
  • kl loss in ViT example supposed to be divided by 2?

    kl loss in ViT example supposed to be divided by 2?

    https://github.com/dropreg/R-Drop/blob/3d97565595747f3b3d9c4701cb2fb824a9139913/vit_src/models/modeling.py#L298

    Isn't L298 supposed to be the following?

    loss += self.alpha * (kl_loss + reverse_kl_loss) / 2
    
    opened by krenerd 1
  • Some question about reproducing GLUE

    Some question about reproducing GLUE

    Sorry to bother you, I'm very interested in your work:R-Drop, but I encountered some problems when reproducing the GLUE experiment with bert-base-uncased. I used pytorch version = 1.8, python version = 3.6.13 and pip install --editable ., Different hyperparameter are also set according to different datasets in readme, but the results of CoLA, RTE and MRPC are only 58.1, 66.4 and 82.8, which are very different from 62.6, 71.1 and 87.3 in the paper.

    opened by wpwpwpyo 0
Owner
null
Pytorch implementation of Learning Rate Dropout.

Learning-Rate-Dropout Pytorch implementation of Learning Rate Dropout. Paper Link: https://arxiv.org/pdf/1912.00144.pdf Train ResNet-34 for Cifar10: r

null 42 Nov 25, 2022
Unofficial PyTorch implementation of Guided Dropout

Unofficial PyTorch implementation of Guided Dropout This is a simple implementation of Guided Dropout for research. We try to reproduce the algorithm

null 2 Jan 7, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 364 Dec 28, 2022
Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer

Two-Stage Peer-Regularized Feature Recombination for Arbitrary Image Style Transfer Paper on arXiv Public PyTorch implementation of two-stage peer-reg

NNAISENSE 38 Oct 14, 2022
(IEEE TIP 2021) Regularized Densely-connected Pyramid Network for Salient Instance Segmentation

RDPNet IEEE TIP 2021: Regularized Densely-connected Pyramid Network for Salient Instance Segmentation PyTorch training and testing code are available.

Yu-Huan Wu 41 Oct 21, 2022
Disagreement-Regularized Imitation Learning

Due to a normalization bug the expert trajectories have lower performance than the rl_baseline_zoo reported experts. Please see the following link in

Kianté Brantley 25 Apr 28, 2022
Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun

ARAE Code for the paper "Adversarially Regularized Autoencoders (ICML 2018)" by Zhao, Kim, Zhang, Rush and LeCun https://arxiv.org/abs/1706.04223 Disc

Junbo (Jake) Zhao 399 Jan 2, 2023
This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

vanint 18 Dec 17, 2022
Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Yaoming Cai 5 Jul 18, 2022
Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation The skip connections in U-Net pass features from the levels of enc

Boheng Cao 1 Dec 29, 2021
Code for the paper: On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

Non-Parametric Prior Actor-Critic (N-PPAC) This repository contains the code for On Pathologies in KL-Regularized Reinforcement Learning from Expert D

Cong Lu 5 May 13, 2022
Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

This is a Python implementation of cover trees, a data structure for finding nearest neighbors in a general metric space (e.g., a 3D box with periodic

Patrick Varilly 28 Nov 25, 2022
Complex-Valued Neural Networks (CVNN)Complex-Valued Neural Networks (CVNN)

Complex-Valued Neural Networks (CVNN) Done by @NEGU93 - J. Agustin Barrachina Using this library, the only difference with a Tensorflow code is that y

youceF 1 Nov 12, 2021
This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CNPs), Neural Processes (NPs), Attentive Neural Processes (ANPs).

The Neural Process Family This repository contains notebook implementations of the following Neural Process variants: Conditional Neural Processes (CN

DeepMind 892 Dec 28, 2022
A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Kordel K. France 2 Nov 14, 2022
Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks This repository contains the code that accompanies our CVPR 20

Despoina Paschalidou 161 Dec 20, 2022
Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayesian-Torch is designed to be flexible and seamless in extending a deterministic deep neural network architecture to corresponding Bayesian form by simply replacing the deterministic layers with Bayesian layers.

Intel Labs 210 Jan 4, 2023
An implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks in PyTorch.

Neural Attention Distillation This is an implementation demo of the ICLR 2021 paper Neural Attention Distillation: Erasing Backdoor Triggers from Deep

Yige-Li 84 Jan 4, 2023
DeepHyper: Scalable Asynchronous Neural Architecture and Hyperparameter Search for Deep Neural Networks

What is DeepHyper? DeepHyper is a software package that uses learning, optimization, and parallel computing to automate the design and development of

DeepHyper Team 214 Jan 8, 2023