This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Overview

BanglaBERT

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Table of Contents

Models

We are releasing a slightly better checkpoint than the one reported in the paper, pretrained with 27.5 GB data, more code switched and code mixed texts, and pretrained further for 2.5M steps. The pretrained model checkpoint is available here. To use this model for the supported downstream tasks in this repository see Training & Evaluation.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.

Datasets

We are also releasing the Bangla Natural Language Inference (NLI) dataset introduced in the paper. The dataset can be found here.

Setup

For installing the necessary requirements, use the following snippet

$ git clone https://https://github.com/csebuetnlp/banglabert
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash setup.sh 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

  • Sequence Classification.
    • For single sequence classification such as
      • Document classification
      • Sentiment classification
      • Emotion classification etc.
    • For double sequence classification such as
      • Natural Language Inference (NLI)
      • Paraphrase detection etc.
  • Token Classification.
    • For token tagging / classification tasks such as
      • Named Entity Recognition (NER)
      • Parts of Speech Tagging (PoS) etc.

Benchmarks

SC EC DC NER NLI
Metrics Accuracy F1* Accuracy F1 (Entity)* Accuracy
mBERT 83.39 56.02 98.64 67.40 75.40
XLM-R 89.49 66.70 98.71 70.63 76.87
sagorsarker/bangla-bert-base 87.30 61.51 98.79 70.97 70.48
monsoon-nlp/bangla-electra 73.54 34.55 97.64 52.57 63.48
BanglaBERT 92.18 74.27 99.07 72.18 82.94

* - Weighted Average

The benchmarking datasets are as follows:

Acknowledgements

We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.

License

Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License

Citation

If you use any of the datasets, models or code modules, please cite the following paper:

@article{bhattacharjee2021banglabert,
  author    = {Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
  title     = {BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
  journal   = {CoRR},
  volume    = {abs/2101.00204},
  year      = {2021},
  url       = {https://arxiv.org/abs/2101.00204},
  eprinttype = {arXiv},
  eprint    = {2101.00204}
}
Comments
  • BanglaBERT is not loading due to problem in config.json file

    BanglaBERT is not loading due to problem in config.json file

    Hi, I have tried to use your BanglaBERT model from huggingface. I have used the below code. The error message is telling that there is some problem in your config.json file. Can you please fix this issue?

    
    !pip install git+https://github.com/csebuetnlp/normalizer
    from transformers import AutoModelForPreTraining, AutoTokenizer
    from normalizer import normalize
    import torch
    
    model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
    tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
    
    text = "আমি বিদ্যালয়ে যাই ।"
    text = normalize(text)
    
    tokenizer_bbert.tokenize(text)
    
    

    image

    opened by MusfiqDehan 4
  • Error while finetuning in google colab using GPU

    Error while finetuning in google colab using GPU

    Hi,

    I want to finetune BanglaBERT for sequence classification.

    • I face the following error while running on google colab using GPU.
    • I don't face this error while running on google colab using CPU.

    This error occurred while running the following command (the example of sequence classificaton from github):

    python ./sequence_classification/sequence_classification.py --overwrite_output_dir --model_name_or_path "csebuetnlp/banglabert" --dataset_dir "./sequence_classification/sample_inputs/single_sequence/jsonl" --output_dir "./sequence_classification/outputs/" --learning_rate=2e-5 --warmup_ratio 0.1 --gradient_accumulation_steps 2 --weight_decay 0.1 --lr_scheduler_type "linear" --per_device_train_batch_size=16 --per_device_eval_batch_size=16 --max_seq_length 512 --logging_strategy "epoch" --save_strategy "epoch" --evaluation_strategy "epoch" --num_train_epochs=3 --do_train --do_eval
    

    Error Traceback:

    05/08/2022 09:24:21 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
    05/08/2022 09:24:21 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
    _n_gpu=1,
    adafactor=False,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    dataloader_drop_last=False,
    dataloader_num_workers=0,
    dataloader_pin_memory=True,
    ddp_find_unused_parameters=None,
    debug=[],
    deepspeed=None,
    disable_tqdm=False,
    do_eval=True,
    do_predict=False,
    do_train=True,
    eval_accumulation_steps=None,
    eval_steps=None,
    evaluation_strategy=IntervalStrategy.EPOCH,
    fp16=False,
    fp16_backend=auto,
    fp16_full_eval=False,
    fp16_opt_level=O1,
    gradient_accumulation_steps=2,
    greater_is_better=None,
    group_by_length=False,
    ignore_data_skip=False,
    label_names=None,
    label_smoothing_factor=0.0,
    learning_rate=2e-05,
    length_column_name=length,
    load_best_model_at_end=False,
    local_rank=-1,
    log_level=-1,
    log_level_replica=-1,
    log_on_each_node=True,
    logging_dir=./sequence_classification/outputs/runs/May08_09-24-21_0da7ed02e26d,
    logging_first_step=False,
    logging_steps=500,
    logging_strategy=IntervalStrategy.EPOCH,
    lr_scheduler_type=SchedulerType.LINEAR,
    max_grad_norm=1.0,
    max_steps=-1,
    metric_for_best_model=None,
    mp_parameters=,
    no_cuda=False,
    num_train_epochs=3.0,
    output_dir=./sequence_classification/outputs/,
    overwrite_output_dir=True,
    past_index=-1,
    per_device_eval_batch_size=16,
    per_device_train_batch_size=16,
    prediction_loss_only=False,
    push_to_hub=False,
    push_to_hub_model_id=outputs,
    push_to_hub_organization=None,
    push_to_hub_token=None,
    remove_unused_columns=True,
    report_to=['tensorboard'],
    resume_from_checkpoint=None,
    run_name=./sequence_classification/outputs/,
    save_on_each_node=False,
    save_steps=500,
    save_strategy=IntervalStrategy.EPOCH,
    save_total_limit=None,
    seed=42,
    sharded_ddp=[],
    skip_memory_metrics=True,
    tpu_metrics_debug=False,
    tpu_num_cores=None,
    use_legacy_prediction_loop=False,
    warmup_ratio=0.1,
    warmup_steps=0,
    weight_decay=0.1,
    )
    05/08/2022 09:24:21 - WARNING - datasets.builder - Using custom data configuration default-1e09c73b0f004fd6
    05/08/2022 09:24:21 - INFO - datasets.builder - Overwrite dataset info from restored data version.
    05/08/2022 09:24:21 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0
    05/08/2022 09:24:21 - WARNING - datasets.builder - Reusing dataset json (/root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0)
    05/08/2022 09:24:21 - INFO - datasets.info - Loading Dataset info from /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0
    100% 3/3 [00:00<00:00, 886.31it/s]
    [INFO|configuration_utils.py:561] 2022-05-08 09:24:22,163 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
    [INFO|configuration_utils.py:598] 2022-05-08 09:24:22,164 >> Model config ElectraConfig {
      "architectures": [
        "ElectraForPreTraining"
      ],
      "attention_probs_dropout_prob": 0.1,
      "classifier_dropout": null,
      "embedding_size": 768,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 512,
      "model_type": "electra",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 0,
      "position_embedding_type": "absolute",
      "summary_activation": "gelu",
      "summary_last_dropout": 0.1,
      "summary_type": "first",
      "summary_use_proj": true,
      "transformers_version": "4.11.0.dev0",
      "type_vocab_size": 2,
      "vocab_size": 32000
    }
    
    [INFO|configuration_utils.py:561] 2022-05-08 09:24:23,954 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
    [INFO|configuration_utils.py:598] 2022-05-08 09:24:23,955 >> Model config ElectraConfig {
      "architectures": [
        "ElectraForPreTraining"
      ],
      "attention_probs_dropout_prob": 0.1,
      "classifier_dropout": null,
      "embedding_size": 768,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 512,
      "model_type": "electra",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 0,
      "position_embedding_type": "absolute",
      "summary_activation": "gelu",
      "summary_last_dropout": 0.1,
      "summary_type": "first",
      "summary_use_proj": true,
      "transformers_version": "4.11.0.dev0",
      "type_vocab_size": 2,
      "vocab_size": 32000
    }
    
    [INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/65e95b847336b6bf69b37fdb8682a97e822799adcd9745dcf9bf44cfe4db1b9a.8f92ca2cf7e2eaa550b10c40331ae9bf0f2e40abe3b549f66a3d7f13bfc6de47
    [INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/added_tokens.json from cache at None
    [INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/7820dfc553e8dfb8a1e82042b7d0d691c7a7cd1e30ed2974218f696e81c5f3b1.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
    [INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/76fa87a0ec9c34c9b15732bf7e06bced447feff46287b8e7d246a55d301784d7.b4f59cefeba4296760d2cf1037142788b96f2be40230bf6393d2fba714562485
    [INFO|tokenization_utils_base.py:1739] 2022-05-08 09:24:29,230 >> loading file https://huggingface.co/csebuetnlp/banglabert/resolve/main/tokenizer.json from cache at None
    [INFO|configuration_utils.py:561] 2022-05-08 09:24:30,126 >> loading configuration file https://huggingface.co/csebuetnlp/banglabert/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/60928dc4b87f5881692890e6541e6538f91588d2ea40cbbbdc04cfb2cb83a6b1.2388211ba94f448fcf40aef3c9526142a8c2f2a8fb4fce8a3801462f51b2bab5
    [INFO|configuration_utils.py:598] 2022-05-08 09:24:30,126 >> Model config ElectraConfig {
      "architectures": [
        "ElectraForPreTraining"
      ],
      "attention_probs_dropout_prob": 0.1,
      "classifier_dropout": null,
      "embedding_size": 768,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 512,
      "model_type": "electra",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "pad_token_id": 0,
      "position_embedding_type": "absolute",
      "summary_activation": "gelu",
      "summary_last_dropout": 0.1,
      "summary_type": "first",
      "summary_use_proj": true,
      "transformers_version": "4.11.0.dev0",
      "type_vocab_size": 2,
      "vocab_size": 32000
    }
    
    [INFO|modeling_utils.py:1279] 2022-05-08 09:24:31,075 >> loading weights file https://huggingface.co/csebuetnlp/banglabert/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/913ea71768a80ccdde3a9ab9a88cf2a93f37a52008896997655d0f63b0d0743a.8aaedac281b72dbb5296319c53be5a4e4a52339eded3f68d49201e140e221615
    [WARNING|modeling_utils.py:1516] 2022-05-08 09:24:32,600 >> Some weights of the model checkpoint at csebuetnlp/banglabert were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
    - This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    - This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    [WARNING|modeling_utils.py:1527] 2022-05-08 09:24:32,600 >> Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at csebuetnlp/banglabert and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-c8c752bb15628b86.arrow
    05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-1d7e8a13339dd538.arrow
    05/08/2022 09:24:32 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-5b734993f8fa5b18.arrow
    05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-ae957e77cc0e01d1.arrow
    05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-ad37b78f61cc4fc6.arrow
    05/08/2022 09:24:33 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-1e09c73b0f004fd6/0.0.0/cache-efbe758578e42091.arrow
    05/08/2022 09:24:33 - INFO - __main__ - Sample 0 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 4992, 10267, 784, 27147, 415, 830, 7761, 1333, 16, 983, 12484, 825, 5083, 2893, 426, 2636, 16493, 415, 815, 2068, 795, 205, 3], 'label': 0, 'sentence1': 'যেই মাদারির পোলারা এই কাজটি করেছে, সেই সালারা অবৈধ জারপ সন্তান ছারা আর কিছুই না।', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}.
    05/08/2022 09:24:33 - INFO - __main__ - Sample 3 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 10634, 5452, 817, 972, 6037, 3], 'label': 0, 'sentence1': 'মুসা কপা\u200cলে কি আ\u200cছে জা\u200cনিনা', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}.
    05/08/2022 09:24:33 - INFO - __main__ - Sample 1 of the training set: {'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [2, 2157, 18812, 16332, 12062, 16135, 1292, 3], 'label': 0, 'sentence1': 'ভারতের কুখ্যাত ষড়যন্ত্রের মুখোশ উন্মোচন হলো', 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0]}.
    05/08/2022 09:24:35 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy
    05/08/2022 09:24:35 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b
    05/08/2022 09:24:35 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py to /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b/accuracy.py
    05/08/2022 09:24:35 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/dataset_infos.json
    05/08/2022 09:24:35 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/accuracy/accuracy.py at /root/.cache/huggingface/modules/datasets_modules/metrics/accuracy/6dba4616f6b2bbd19659d1db3a48cc001c8f13a10cdc73a2641a55f7a60b7b5b/accuracy.json
    05/08/2022 09:24:36 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision
    05/08/2022 09:24:36 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690
    05/08/2022 09:24:36 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py to /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690/precision.py
    05/08/2022 09:24:36 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/dataset_infos.json
    05/08/2022 09:24:36 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/precision/precision.py at /root/.cache/huggingface/modules/datasets_modules/metrics/precision/94709a71c6fe37171ef49d3466fec24dee9a79846c9f176dff66a649e9811690/precision.json
    05/08/2022 09:24:38 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall
    05/08/2022 09:24:38 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8
    05/08/2022 09:24:38 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py to /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8/recall.py
    05/08/2022 09:24:38 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/dataset_infos.json
    05/08/2022 09:24:38 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/recall/recall.py at /root/.cache/huggingface/modules/datasets_modules/metrics/recall/1e3b93e2bed42e1498e628f161d79ee019dd8e78d36985d3c7ecbc018adf35e8/recall.json
    05/08/2022 09:24:39 - INFO - datasets.load - Found main folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1
    05/08/2022 09:24:39 - INFO - datasets.load - Found specific version folder for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9
    05/08/2022 09:24:39 - INFO - datasets.load - Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py to /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9/f1.py
    05/08/2022 09:24:39 - INFO - datasets.load - Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/dataset_infos.json
    05/08/2022 09:24:39 - INFO - datasets.load - Found metadata file for metric https://raw.githubusercontent.com/huggingface/datasets/1.11.0/metrics/f1/f1.py at /root/.cache/huggingface/modules/datasets_modules/metrics/f1/6c86fddbf90432b9c43a8c38c62a0dd9de63bad2ef0a896f9ae20273d6d6f6e9/f1.json
    [INFO|trainer.py:521] 2022-05-08 09:24:43,888 >> The following columns in the training set  don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: sentence1.
    [INFO|trainer.py:1168] 2022-05-08 09:24:43,900 >> ***** Running training *****
    [INFO|trainer.py:1169] 2022-05-08 09:24:43,900 >>   Num examples = 4
    [INFO|trainer.py:1170] 2022-05-08 09:24:43,900 >>   Num Epochs = 3
    [INFO|trainer.py:1171] 2022-05-08 09:24:43,900 >>   Instantaneous batch size per device = 16
    [INFO|trainer.py:1172] 2022-05-08 09:24:43,900 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
    [INFO|trainer.py:1173] 2022-05-08 09:24:43,900 >>   Gradient Accumulation steps = 2
    [INFO|trainer.py:1174] 2022-05-08 09:24:43,900 >>   Total optimization steps = 3
      0% 0/3 [00:00<?, ?it/s]Traceback (most recent call last):
      File "./sequence_classification/sequence_classification.py", line 479, in <module>
        main()
      File "./sequence_classification/sequence_classification.py", line 426, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
      File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1284, in train
        tr_loss += self.training_step(model, inputs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1789, in training_step
        loss = self.compute_loss(model, inputs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1821, in compute_loss
        outputs = model(**inputs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 973, in forward
        return_dict,
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 879, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/transformers/models/electra/modeling_electra.py", line 206, in forward
        inputs_embeds = self.word_embeddings(input_ids)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
        self.norm_type, self.scale_grad_by_freq, self.sparse)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2183, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
      0% 0/3 [00:00<?, ?it/s]
    

    Probable solution from pytorch discussion forum which I can't figure out: https://discuss.pytorch.org/t/code-that-loads-sgd-fails-to-load-adam-state-to-gpu/61783/3?u=shaibagon

    Thanks.

    opened by fahshed 2
  • The websites the dataset was scraped from?

    The websites the dataset was scraped from?

    As Alexa Web rankings shut down in May, 2022, (https://www.alexa.com/topsites/countries/BD), it is not possible to retrieve the names of the Bangladeshi websites used.

    It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.

    opened by imr555 1
  • Can I extract word embeddings using BanglaBERT ?

    Can I extract word embeddings using BanglaBERT ?

    Hi, Is it possible to extract/generate word embeddings using BanglaBERT? I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.

    !pip install transformers
    !pip install git+https://github.com/csebuetnlp/normalizer
    
    from transformers import AutoModelForPreTraining, AutoTokenizer
    from normalizer import normalize
    import torch
    
    model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
    tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
    
    
    text = 'দেশদ্রোহিতার মামলা স্বর্ণ মন্দিরের ভিতর ও বৈশাখী উৎসবের মিছিলে খলিস্তানপন্থী স্লোগান দেওয়ার জন্য কয়েকজন বিশ্ব যুবকের বিরুদ্ধে দেশদ্রোহিতার মামলা দায়ের করা হয়েছে ।'
    
    text = normalize(text)
    
    text = tokenizer_bbert.tokenize(text)
    
    print(text)
    
    # >>  ['দেশদ্রোহ', '##িতার', 'মামলা', 'স্বর্ণ', 'মন্দিরের', 'ভিতর', 'ও', 'বৈশাখী', 'উৎসবের', 'মিছিলে', 'খলি', '##স্তান', '##পন্থী', 'স্লোগান', 'দেওয়ার','জন্য', 'কয়েকজন', 'বিশ্ব', 'যুবকের', 'বিরুদ্ধে', 'দেশদ্রোহ', '##িতার', 'মামলা', 'দায়ের', 'করা', 'হয়েছে', '।']
    
    

    I have find out how to generate Word Embeddings using BERT. Here is the link (https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958). Will it be same for BanglaBERT or Bangla Language or it will be better to use a different Bangla Language specific approach?

    Any kind of suggestion or advice will be helpful for me. Thanks in advance.

    opened by MusfiqDehan 1
  • add web demo/models/datasets to NAACL 2022 organization on Hugging Face

    add web demo/models/datasets to NAACL 2022 organization on Hugging Face

    Hi, congrats for the acceptance at NAACL 2022. We are having an event on Hugging Face for NAACL 2022, where you can submit spaces(web demos), models, and datasets for papers for a chance to win prizes. Hugging Hub works similar to github where you can push to user profiles or organization accounts, you can add the models/datasets and spaces to this organization: https://huggingface.co/NAACL2022, after joining the organization using this link https://huggingface.co/organizations/NAACL2022/share/FnuCfwNhiIRWAlngiEkLcwuUrMDMTCPbje, let me know if you need any help with the above steps, thanks.

    Also I see you already have models and datasets here: https://huggingface.co/csebuetnlp, would be great to have as a submission to the event, you can simply clone and push this to the NAACL 2022 organization

    opened by AK391 0
Owner
null
Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Zhenhailong Wang 2 Jul 15, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 631 Feb 2, 2021
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO ?? ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 1, 2023
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

null 67 Dec 15, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
Predict an emoji that is associated with a text

Sentiment Analysis Sentiment analysis in computational linguistics is a general term for techniques that quantify sentiment or mood in a text. Can you

Tetsumichi(Telly) Umada 30 Sep 7, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

null 79 Dec 27, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

null 537 Jan 5, 2023
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022