This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".



Table of Contents


We are releasing a slightly better checkpoint than the one reported in the paper, pretrained with 27.5 GB data, more code switched and code mixed texts, and pretrained further for 2.5M steps. The pretrained model checkpoint is available here. To use this model for the supported downstream tasks in this repository see Training & Evaluation.

Note: This model was pretrained using a specific normalization pipeline available here. All finetuning scripts in this repository uses this normalization by default. If you need to adapt the pretrained model for a different task make sure the text units are normalized using this pipeline before tokenizing to get best results. A basic example is available at the model page.


We are also releasing the Bangla Natural Language Inference (NLI) dataset introduced in the paper. The dataset can be found here.


For installing the necessary requirements, use the following snippet

$ git clone https://
$ cd banglabert/
$ conda create python==3.7.9 pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ bash 
  • Use the newly created environment for running the scripts in this repository.

Training & Evaluation

To use the pretrained model for finetuning / inference on different downstream tasks see the following section:

  • Sequence Classification.
    • For single sequence classification such as
      • Document classification
      • Sentiment classification
      • Emotion classification etc.
    • For double sequence classification such as
      • Natural Language Inference (NLI)
      • Paraphrase detection etc.
  • Token Classification.
    • For token tagging / classification tasks such as
      • Named Entity Recognition (NER)
      • Parts of Speech Tagging (PoS) etc.


Metrics Accuracy F1* Accuracy F1 (Entity)* Accuracy
mBERT 83.39 56.02 98.64 67.40 75.40
XLM-R 89.49 66.70 98.71 70.63 76.87
sagorsarker/bangla-bert-base 87.30 61.51 98.79 70.97 70.48
monsoon-nlp/bangla-electra 73.54 34.55 97.64 52.57 63.48
BanglaBERT 92.18 74.27 99.07 72.18 82.94

* - Weighted Average

The benchmarking datasets are as follows:


We would like to thank Intelligent Machines and Google TFRC Program for providing cloud support for pretraining the models.


Contents of this repository are restricted to non-commercial research purposes only under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

Creative Commons License


If you use any of the datasets, models or code modules, please cite the following paper:

  author    = {Abhik Bhattacharjee and Tahmid Hasan and Kazi Samin and Md Saiful Islam and M. Sohel Rahman and Anindya Iqbal and Rifat Shahriyar},
  title     = {BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding},
  journal   = {CoRR},
  volume    = {abs/2101.00204},
  year      = {2021},
  url       = {},
  eprinttype = {arXiv},
  eprint    = {2101.00204}
  • BanglaBERT is not loading due to problem in config.json file

    BanglaBERT is not loading due to problem in config.json file

    Hi, I have tried to use your BanglaBERT model from huggingface. I have used the below code. The error message is telling that there is some problem in your config.json file. Can you please fix this issue?

    !pip install git+
    from transformers import AutoModelForPreTraining, AutoTokenizer
    from normalizer import normalize
    import torch
    model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
    tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
    text = "আমি বিদ্যালয়ে যাই ।"
    text = normalize(text)


    opened by MusfiqDehan 4
  • Error while finetuning in google colab using GPU

    Error while finetuning in google colab using GPU


    I want to finetune BanglaBERT for sequence classification.

    • I face the following error while running on google colab using GPU.
    • I don't face this error while running on google colab using CPU.

    This error occurred while running the following command (the example of sequence classificaton from github):

    python ./sequence_classification/ --overwrite_output_dir --model_name_or_path "csebuetnlp/banglabert" --dataset_dir "./sequence_classification/sample_inputs/single_sequence/jsonl" --output_dir "./sequence_classification/outputs/" --learning_rate=2e-5 --warmup_ratio 0.1 --gradient_accumulation_steps 2 --weight_decay 0.1 --lr_scheduler_type "linear" --per_device_train_batch_size=16 --per_device_eval_batch_size=16 --max_seq_length 512 --logging_strategy "epoch" --save_strategy "epoch" --evaluation_strategy "epoch" --num_train_epochs=3 --do_train --do_eval

    Error Traceback:

    Probable solution from pytorch discussion forum which I can't figure out:


    opened by fahshed 2
  • The websites the dataset was scraped from?

    The websites the dataset was scraped from?

    As Alexa Web rankings shut down in May, 2022, (, it is not possible to retrieve the names of the Bangladeshi websites used.

    It would be really useful if the names of the fifty Bangladeshi websites used to scrape the dataset could be released. It would help understand the nature of the dataset used to train the model and help in model interpretability experiments too.

    opened by imr555 1
  • Can I extract word embeddings using BanglaBERT ?

    Can I extract word embeddings using BanglaBERT ?

    Hi, Is it possible to extract/generate word embeddings using BanglaBERT? I have tokenized my Bangla sentence using BanglaBERT. Now I want to generate Word Embeddings from my tokenized sentence.

    !pip install transformers
    !pip install git+
    from transformers import AutoModelForPreTraining, AutoTokenizer
    from normalizer import normalize
    import torch
    model = AutoModelForPreTraining.from_pretrained("csebuetnlp/banglabert")
    tokenizer_bbert = AutoTokenizer.from_pretrained("csebuetnlp/banglabert")
    text = 'দেশদ্রোহিতার মামলা স্বর্ণ মন্দিরের ভিতর ও বৈশাখী উৎসবের মিছিলে খলিস্তানপন্থী স্লোগান দেওয়ার জন্য কয়েকজন বিশ্ব যুবকের বিরুদ্ধে দেশদ্রোহিতার মামলা দায়ের করা হয়েছে ।'
    text = normalize(text)
    text = tokenizer_bbert.tokenize(text)
    # >>  ['দেশদ্রোহ', '##িতার', 'মামলা', 'স্বর্ণ', 'মন্দিরের', 'ভিতর', 'ও', 'বৈশাখী', 'উৎসবের', 'মিছিলে', 'খলি', '##স্তান', '##পন্থী', 'স্লোগান', 'দেওয়ার','জন্য', 'কয়েকজন', 'বিশ্ব', 'যুবকের', 'বিরুদ্ধে', 'দেশদ্রোহ', '##িতার', 'মামলা', 'দায়ের', 'করা', 'হয়েছে', '।']

    I have find out how to generate Word Embeddings using BERT. Here is the link ( Will it be same for BanglaBERT or Bangla Language or it will be better to use a different Bangla Language specific approach?

    Any kind of suggestion or advice will be helpful for me. Thanks in advance.

    opened by MusfiqDehan 1
  • add web demo/models/datasets to NAACL 2022 organization on Hugging Face

    add web demo/models/datasets to NAACL 2022 organization on Hugging Face

    Hi, congrats for the acceptance at NAACL 2022. We are having an event on Hugging Face for NAACL 2022, where you can submit spaces(web demos), models, and datasets for papers for a chance to win prizes. Hugging Hub works similar to github where you can push to user profiles or organization accounts, you can add the models/datasets and spaces to this organization:, after joining the organization using this link, let me know if you need any help with the above steps, thanks.

    Also I see you already have models and datasets here:, would be great to have as a submission to the event, you can simply clone and push this to the NAACL 2022 organization

    opened by AK391 0
