DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Overview

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Disentangled Attention and DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

News

12/8/2021

  • DeBERTa-V3-XSmall is added. With only 22M backbone parameters which is only 1/4 of RoBERTa-Base and XLNet-Base, DeBERTa-V3-XSmall significantly outperforms the later on MNLI and SQuAD v2.0 tasks (i.e. 1.2% on MNLI-m, 1.5% EM score on SQuAD v2.0). This further demnostrates the efficiency of DeBERTaV3 models.

11/16/2021

3/31/2021

  • Masked language model task is added
  • SuperGLUE tasks is added
  • SiFT code is added

2/03/2021

DeBERTa v2 code and the 900M, 1.5B model are here now. This includes the 1.5B model used for our SuperGLUE single-model submission and achieving 89.9, versus human baseline 89.8. You can find more details about this submission in our blog

What's new in v2

  • Vocabulary In v2 we use a new vocabulary of size 128K built from the training data. Instead of GPT2 tokenizer, we use sentencepiece tokenizer.
  • nGiE(nGram Induced Input Encoding) In v2 we use an additional convolution layer aside with the first transformer layer to better learn the local dependency of input tokens. We will add more ablation studies on this feature.
  • Sharing position projection matrix with content projection matrix in attention layer Based on our previous experiment, we found this can save parameters without affecting the performance.
  • Apply bucket to encode relative postions In v2 we use log bucket to encode relative positions similar to T5.
  • 900M model & 1.5B model In v2 we scale our model size to 900M and 1.5B which significantly improves the performance of downstream tasks.

12/29/2020

With DeBERTa 1.5B model, we surpass T5 11B model and human performance on SuperGLUE leaderboard. Code and model will be released soon. Please check out our paper for more details.

06/13/2020

We released the pre-trained models, source code, and fine-tuning scripts to reproduce some of the experimental results in the paper. You can follow similar scripts to apply DeBERTa to your own experiments or applications. Pre-training scripts will be released in the next step.

Introduction to DeBERTa

DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks.

Pre-trained Models

Our pre-trained models are packaged into zipped files. You can download them from our releases, or download an individual model via the links below:

Model Vocabulary(K) Backbone Parameters(M) Hidden Size Layers Note
V2-XXLarge1 128 1320 1536 48 128K new SPM vocab
V2-XLarge 128 710 1536 24 128K new SPM vocab
XLarge 50 700 1024 48 Same vocab as RoBERTa
Large 50 350 1024 24 Same vocab as RoBERTa
Base 50 100 768 12 Same vocab as RoBERTa
V2-XXLarge-MNLI 128 1320 1536 48 Fine-turned with MNLI
V2-XLarge-MNLI 128 710 1536 24 Fine-turned with MNLI
XLarge-MNLI 50 700 1024 48 Fine-turned with MNLI
Large-MNLI 50 350 1024 24 Fine-turned with MNLI
Base-MNLI 50 86 768 12 Fine-turned with MNLI
DeBERTa-V3-Large2 128 304 1024 24 128K new SPM vocab
DeBERTa-V3-Base2 128 86 768 12 128K new SPM vocab
DeBERTa-V3-Small2 128 44 768 6 128K new SPM vocab
DeBERTa-V3-XSmall2 128 22 384 12 128K new SPM vocab
mDeBERTa-V3-Base2 250 86 768 12 250K new SPM vocab, multi-lingual model with 102 languages

Note

  • 1 This is the model(89.9) that surpassed T5 11B(89.3) and human performance(89.8) on SuperGLUE for the first time. 128K new SPM vocab.
  • 2 These V3 DeBERTa models are deberta models pre-trained with ELECTRA-style objective plus gradient-disentangled embedding sharing which significantly improves the model efficiency.

Try the model

Read our documentation

Requirements

  • Linux system, e.g. Ubuntu 18.04LTS
  • CUDA 10.0
  • pytorch 1.3.0
  • python 3.6
  • bash shell 4.0
  • curl
  • docker (optional)
  • nvidia-docker2 (optional)

There are several ways to try our code,

Use docker

Docker is the recommended way to run the code as we already built every dependency into the our docker bagai/deberta and you can follow the docker official site to install docker on your machine.

To run with docker, make sure your system fullfil the requirements in the above list. Here are the steps to try the GLUE experiments: Pull the code, run ./run_docker.sh , and then you can run the bash commands under /DeBERTa/experiments/glue/

Use pip

Pull the code and run pip3 install -r requirements.txt in the root directory of the code, then enter experiments/glue/ folder of the code and try the bash commands under that folder for glue experiments.

Install as a pip package

pip install deberta

Use DeBERTa in existing code

# To apply DeBERTa into your existing code, you need to make two changes on your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import deberta
import torch
class MyModel(torch.nn.Module):
  def __init__(self):
    super().__init__()
    # Your existing model code
    self.deberta = deberta.DeBERTa(pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self.deberta.apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward(self, input_ids):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicate which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = deberta.bert(input_ids)[-1]

# 2. Change your tokenizer with the the tokenizer built in DeBERta
from DeBERTa import deberta
vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base')
tokenizer = deberta.tokenizers[vocab_type](vocab_path)
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer.tokenize('Examples input text of DeBERTa')
# Truncate long sequence
tokens = tokens[:max_seq_len -2]
# Add special tokens to the `tokens`
tokens = ['[CLS]'] + tokens + ['[SEP]']
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1]*len(input_ids)
# padding
paddings = max_seq_len-len(input_ids)
input_ids = input_ids + [0]*paddings
input_mask = input_mask + [0]*paddings
features = {
'input_ids': torch.tensor(input_ids, dtype=torch.int),
'input_mask': torch.tensor(input_mask, dtype=torch.int)
}

Run DeBERTa experiments from command line

For glue tasks,

  1. Get the data
cache_dir=/tmp/DeBERTa/
cd experiments/glue
./download_data.sh  $cache_dir/glue_tasks
  1. Run task
task=STS-B 
OUTPUT=/tmp/DeBERTa/exps/$task
export OMP_NUM_THREADS=1
python3 -m DeBERTa.apps.run --task_name $task --do_train  \
  --data_dir $cache_dir/glue_tasks/$task \
  --eval_batch_size 128 \
  --predict_batch_size 128 \
  --output_dir $OUTPUT \
  --scale_steps 250 \
  --loss_scale 16384 \
  --accumulative_update 1 \  
  --num_train_epochs 6 \
  --warmup 100 \
  --learning_rate 2e-5 \
  --train_batch_size 32 \
  --max_seq_len 128

Notes

    1. By default we will cache the pre-trained model and tokenizer at $HOME/.~DeBERTa, you may need to clean it if the downloading failed unexpectedly.
    1. You can also try our models with HF Transformers. But when you try XXLarge model you need to specify --sharded_ddp argument. Please check our XXLarge model card for more details.

Experiments

Our fine-tuning experiments are carried on half a DGX-2 node with 8x32 V100 GPU cards, the results may vary due to different GPU models, drivers, CUDA SDK versions, using FP16 or FP32, and random seeds. We report our numbers based on multple runs with different random seeds here. Here are the results from the Large model:

Task Command Results Running Time(8x32G V100 GPUs)
MNLI xxlarge v2 experiments/glue/mnli.sh xxlarge-v2 91.7/91.9 +/-0.1 4h
MNLI xlarge v2 experiments/glue/mnli.sh xlarge-v2 91.7/91.6 +/-0.1 2.5h
MNLI xlarge experiments/glue/mnli.sh xlarge 91.5/91.2 +/-0.1 2.5h
MNLI large experiments/glue/mnli.sh large 91.3/91.1 +/-0.1 2.5h
QQP large experiments/glue/qqp.sh large 92.3 +/-0.1 6h
QNLI large experiments/glue/qnli.sh large 95.3 +/-0.2 2h
MRPC large experiments/glue/mrpc.sh large 91.9 +/-0.5 0.5h
RTE large experiments/glue/rte.sh large 86.6 +/-1.0 0.5h
SST-2 large experiments/glue/sst2.sh large 96.7 +/-0.3 1h
STS-b large experiments/glue/Stsb.sh large 92.5 +/-0.3 0.5h
CoLA large experiments/glue/cola.sh 70.5 +/-1.0 0.5h

And here are the results from the Base model

Task Command Results Running Time(8x32G V100 GPUs)
MNLI base experiments/glue/mnli.sh base 88.8/88.5 +/-0.2 1.5h

Fine-tuning on NLU tasks

We present the dev results on SQuAD 1.1/2.0 and several GLUE benchmark tasks.

Model SQuAD 1.1 SQuAD 2.0 MNLI-m/mm SST-2 QNLI CoLA RTE MRPC QQP STS-B
F1/EM F1/EM Acc Acc Acc MCC Acc Acc/F1 Acc/F1 P/S
BERT-Large 90.9/84.1 81.8/79.0 86.6/- 93.2 92.3 60.6 70.4 88.0/- 91.3/- 90.0/-
RoBERTa-Large 94.6/88.9 89.4/86.5 90.2/- 96.4 93.9 68.0 86.6 90.9/- 92.2/- 92.4/-
XLNet-Large 95.1/89.7 90.6/87.9 90.8/- 97.0 94.9 69.0 85.9 90.8/- 92.3/- 92.5/-
DeBERTa-Large1 95.5/90.1 90.7/88.0 91.3/91.1 96.5 95.3 69.5 91.0 92.6/94.6 92.3/- 92.8/92.5
DeBERTa-XLarge1 -/- -/- 91.5/91.2 97.0 - - 93.1 92.1/94.3 - 92.9/92.7
DeBERTa-V2-XLarge1 95.8/90.8 91.4/88.9 91.7/91.6 97.5 95.8 71.1 93.9 92.0/94.2 92.3/89.8 92.9/92.9
DeBERTa-V2-XXLarge1,2 96.1/91.4 92.2/89.7 91.7/91.9 97.2 96.0 72.0 93.5 93.1/94.9 92.7/90.3 93.2/93.1
DeBERTa-V3-Large -/- 91.5/89.0 91.8/91.9 96.9 96.0 75.3 92.7 92.2/- 93.0/- 93.0/-
DeBERTa-V3-Base -/- 88.4/85.4 90.6/90.7 - - - - - - -
DeBERTa-V3-Small -/- 82.9/80.4 88.3/87.7 - - - - - - -
DeBERTa-V3-XSmall -/- 84.8/82.0 88.1/88.3 - - - - - - -

Fine-tuning on XNLI

We present the dev results on XNLI with zero-shot crosslingual transfer setting, i.e. training with english data only, test on other languages.

Model avg en fr es de el bg ru tr ar vi th zh hi sw ur
XLM-R-base 76.2 85.8 79.7 80.7 78.7 77.5 79.6 78.1 74.2 73.8 76.5 74.6 76.7 72.4 66.5 68.3
mDeBERTa-V3-Base 79.8+/-0.2 88.2 82.6 84.4 82.7 82.3 82.4 80.8 79.5 78.5 78.1 76.4 79.5 75.9 73.9 72.4

Notes.

Pre-training with MLM and RTD objectives

To pre-train DeBERTa with MLM and RTD objectives, please check experiments/language_models

Contacts

Pengcheng He([email protected]), Xiaodong Liu([email protected]), Jianfeng Gao([email protected]), Weizhu Chen([email protected])

Citation

@misc{he2021debertav3,
      title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@inproceedings{
he2021deberta,
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}
Comments
  • mDeBERTa on HuggingFace hub does not seem to work

    mDeBERTa on HuggingFace hub does not seem to work

    I really like the DeBERTa-v3 models and the monolingual models work very well for me. Weirdly enough, the multilingual model uploaded on the huggingface hub does not seem to work. I have a code for training multilingual models on XNLI, and the training normally works well (e.g. no issue with microsoft/Multilingual-MiniLM-L12-H384), but when I apply the exact same code to mDeBERTa, the model does not seem to learn anything. I don't get an error message, but the training results look like this: Screenshot 2021-12-04 at 10 43 58

    I've double checked by running the exact same code on multilingual-minilm and the training works, which makes me think that it's not an issue in the code (wrongly formatting the input data or something like that), but something went wrong when uploading mDeBERTa to the huggingface hub? Accuracy of exactly random 0.3333, 0 training loss at epoch 2 and NaN validation loss maybe indicates that the data is running through the model, but some parameters are not updating or something like that?

    My environment is google colab; Transformers==4.12.5

    opened by MoritzLaurer 6
  • Can't load DeBERTa-v3 tokenizer

    Can't load DeBERTa-v3 tokenizer

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
    

    Gives me an error ValueError: This tokenizer cannot be instantiated. Please make sure you have sentencepiece installed in order to use this tokenizer. But sentencepiece is already installed

    Also tried

    !pip install deberta
    from DeBERTa import deberta
    vocab_path, vocab_type = deberta.load_vocab(pretrained_id='base-v3')
    tokenizer = deberta.tokenizers[vocab_type](vocab_path)
    

    this gives me TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

    Please help, how can I use the tokenizer for deberta-base-v3?

    opened by maiiabocharova 4
  • Issues loading 1.5B model in huggingface and in deberta package

    Issues loading 1.5B model in huggingface and in deberta package

    Hello,

    It seems like some of the weights were renamed/shaped in the V2 model releases and I couldn't quite figure out how to map them to the old structure

    # it seemed like 
    pos_q_proj => query_proj
    v_bias => value_proj
    

    but I couldn't match

    deberta.encoder.layer.44.attention.self.key_proj.weight', 'deberta.encoder.layer.44.attention.self.key_proj.bias
    =>
    deberta.encoder.layer.44.attention.self.q_bias', 'deberta.encoder.layer.44.attention.self.value_proj', 'deberta.encoder.layer.44.attention.self.in_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_proj.weight
    

    That was for huggingface, but I couldn't figure it out in this repo either.

    Could someone upload the v2 model file?

    opened by chessgecko 4
  • HTTP Error 403: Forbidden when downloading glue_tasks

    HTTP Error 403: Forbidden when downloading glue_tasks

    This error occurs when running the setup_glue_data() function inside any Glue Task script, like qqp_large.sh. The error resides on the original script download_glue_data.py, probably the token is no longer valid, but this directly affects this repo making it not possible to reproduce the results with the Glue Tasks.

    Error log:

    Downloading and extracting QQP...
    Traceback (most recent call last):
      File "<stdin>", line 172, in <module>
      File "<stdin>", line 168, in main
      File "<stdin>", line 57, in download_and_extract
      File "/usr/lib/python3.6/urllib/request.py", line 248, in urlretrieve
        with contextlib.closing(urlopen(url, data)) as fp:
      File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.6/urllib/request.py", line 532, in open
        response = meth(req, response)
      File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
        'http', request, response, code, msg, hdrs)
      File "/usr/lib/python3.6/urllib/request.py", line 570, in error
        return self._call_chain(*args)
      File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
        result = func(*args)
      File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
        raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 403: Forbidden
    
    opened by huberemanuel 4
  • Training mDeBERTaV3 with Simple Transformers not successful: macro, micro f1: 0.003, 0.035

    Training mDeBERTaV3 with Simple Transformers not successful: macro, micro f1: 0.003, 0.035

    Hello,

    I would like to fine-tune mDEBERTaV3 for the genre classification task, and compare it to XML-RoBERTa and some other similar models, but the training gives very low results (macro, micro f1: 0.003, 0.035; high running loss: 3.0456) and the confusion matrix shows that the model predicts one class to all instances (different class in different runs). Training other models (XML-RoBERTa, SloBERTa, BERTić etc.) with the same setting (only the model type and model name are changed for each model, otherwise the code and dataset is the same) works without any problems.

    Here are the hyperparameters:

    from simpletransformers.classification import ClassificationModel
    
    model_args ={"overwrite_output_dir": True,
                 "num_train_epochs": 90,
                 "labels_list": LABELS,
                 "learning_rate": 1e-5,
                 "train_batch_size": 32,
                 "no_cache": True,
                 "no_save": True,
                 "max_seq_length": 300,
                 "save_steps": -1
                 }
    
    debertav3_model = ClassificationModel(
            "debertav2", "microsoft/mdeberta-v3-base",
            num_labels=21,
            use_cuda=True,
            args=model_args
        )
    

    The training is performed without being stopped by an error, but there occur some warning messages that might have something to do with the low performance:

    1. When loading the pre-trained model:
    Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
    

    This makes me think that it might be the problem with the model type, but changing it to "deberta-v2", "debertav3", or "deberta" results in an error.

    1. When training:
    /opt/conda/lib/python3.7/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:1313: UserWarning: This overload of nonzero is deprecated:
    	nonzero()
    Consider using one of the following signatures instead:
    	nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
      label_index = (labels >= 0).nonzero()
    /opt/conda/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
      "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
    

    I'm working on Kaggle, and using the following versions: pytorch>=1.6, cudatoolkit=11.0, simpletransformers==0.63.3, torch==1.6.0+cu101

    Thank you very much in advance for your help!

    opened by TajaKuzman 3
  • MLM Pre-training Code Version

    MLM Pre-training Code Version

    Hello ! Thank you for sharing a great piece of work.

    I was wondering whether the MLM pre-training codeis for training DeBERTa v3 or v2 ? (or v1)

    Regards

    opened by robinsongh381 3
  • Pre-trained models are not accessible

    Pre-trained models are not accessible

    Thanks for sharing the repo. However, I could not access the per-trained base and large models from the below paths.

    https://github.com/microsoft/DeBERTa/releases/download/v0.1/base.zip https://github.com/microsoft/DeBERTa/releases/download/v0.1/large.zip

    opened by ashissamal 3
  • Why does the size of DeBERTaV3 double on disk after finetuning?

    Why does the size of DeBERTaV3 double on disk after finetuning?

    On HF, deberta-v3-large is 800mb: https://huggingface.co/microsoft/deberta-v3-large

    But after even a few steps of MLM training the saved model is 1.6gb: https://colab.research.google.com/drive/1PG4PKYnye_F1We2i7VccQ4nYn_XTHhKP?usp=sharing

    This seems true of many other finetuned versions of DeBERTaV3 on HF (for both base and large size). It also doesn't seem specific to MLM: https://huggingface.co/navteca/nli-deberta-v3-large https://huggingface.co/cross-encoder/nli-deberta-v3-base/tree/main

    Any idea why this is -- is it something to do with V3 itself? And does anyone know if the model size can be reduced again after traning?

    Thanks!

    opened by nadahlberg 2
  • the results of debertav3 small on the mnli task

    the results of debertav3 small on the mnli task

    The results of debertav3 small on the mnli validation set, in the paper, reported as 88.2/87.9, look different from those reported in open source: https://huggingface.co/mrm8488/deberta-v3-small-finetuned-mnli (reported as 87.46)

    图片
    opened by nbcc 2
  • DeBERTa V3 Fine-Tuning

    DeBERTa V3 Fine-Tuning

    Thank you very, very much for DeBERTa! I'm using it as a 0-shot relation extractor, and it's works extremely well. I was wondering if you're planning to release the V3 models fine-tuned on MNLI like previous versions on Hugging Face. Thank you again!

    opened by stevemarin 2
  • "deberta-v2-xxlarge"-Model not working!

    I do: from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v2-xxlarge") model = AutoModel.from_pretrained("microsoft/deberta-v2-xxlarge")

    But always the same error occurs: config_class = CONFIG_MAPPING[config_dict["model_type"]] KeyError: 'deberta-v2'

    Appreciate your help!

    opened by kinimod23 2
  • Fix: a few typos as I read through the README.md

    Fix: a few typos as I read through the README.md

    Hi,

    Awesome repo! Just started looking into DeBERTa-based models and as I was reading through the README I noticed a few typos that could be fixed. Let me know what you think.

    Hope this help :)

    opened by cpcdoy 0
  • why vocab.txt and tokenizer.json not in pretrained model in huggingface ??

    why vocab.txt and tokenizer.json not in pretrained model in huggingface ??

    https://huggingface.co/microsoft/deberta-v2-xlarge/tree/main

    If I run : tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v2-xlarge')

    get bug: ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

    opened by XuJianzhi 1
  • AssertionError: [] in google coab

    AssertionError: [] in google coab

    I am trying to use deberta in google colab, getting this:

    AssertionError Traceback (most recent call last) in ----> 1 m = deberta.DeBERTa(pre_trained="large")

    2 frames /usr/local/lib/python3.7/dist-packages/DeBERTa/deberta/deberta.py in key_match(key, s) 141 def key_match(key, s): 142 c = [k for k in s if key in k] --> 143 assert len(c)==1, c 144 return c[0] 145 current = self.state_dict()

    AssertionError: []

    Any ideas?

    opened by yupesh 0
Releases(v0.1.8)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

null 45 Oct 29, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 3, 2023
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.8k Dec 27, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.5k Feb 18, 2021