TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Overview

What is TunBERT?

People in Tunisia use the Tunisian dialect in their daily communications, in most of their media (TV, radio, songs, etc), and on the internet (social media, forums). Yet, this dialect is not standardized which means there is no unique way for writing and speaking it. Added to that, it has its proper lexicon, phonetics, and morphological structures. The need for a robust language model for the Tunisian dialect has become crucial in order to develop NLP-based applications (translation, information retrieval, sentiment analysis, etc).

BERT (Bidirectional Encoder Representations from Transformers) is a method to pre-train general purpose natural language models in an unsupervised fashion and then fine-tune them on specific downstream tasks with labelled datasets. This method was first implemented by Google and gives state-of-the-art results on many tasks as it's the first deeply bidirectional NLP pre-training system.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA).

What has been released in this repository?

This repository includes the code for fine-tuning TunBERT on the three downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA). This will help the community reproduce our work and collaborate continuously. We also released the two pre-trained new models: TunBERT Pytorch and TunBERT TensorFlow. Finally, we open source the fine-tuning datasets used for Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

About the Pre-trained models

TunBERT Pytorch model is based on BERT’s Pytorch implementation from NVIDIA NeMo. The model was pre-trained using 4 NVIDIA Tesla V100 GPUs on a dataset of 500k Tunisian social media comments written in Arabic letters. The pretrained model consists of 12 layers of self-attention modules. Each module is made with 12 heads of self-attention with 768 hidden-size. Furthermore, an Adam optimizer was used, with a learning rate of 1e-4, a batch size of 128, a maximum sequence length of 128 and a masking probability of 15%. Cosine annealing was used for a learning rate scheduling with a warm-up ratio of 0.01.

Similarly, a second TunBERT TensorFlow model was trained using TensorFlow implementation from Google. We use the same compute power for pre-training this model (4 NVIDIA Tesla V100 GPUs) while keeping the same hyper-parameters: A learning rate of 1e-4, a batch size of 128 and a maximum sequence length of 128.

The two models are available for download through:

For TunBERT PyTorch:

For TunBERT TensorFlow:

About the Finetuning datasets

Tunisian Sentiment Analysis

  • Tunisian Sentiment Analysis Corpus (TSAC) obtained from Facebook comments about popular TV shows. The TSAC dataset contains both Arabic and latin characters. Hence, we used only Arabic comments.

Dataset link: TSAC

Reference : Salima Medhaffar, Fethi Bougares, Yannick Estève and Lamia Hadrich-Belguith. Sentiment analysis of Tunisian dialects: Linguistic Resources and Experiments. WANLP 2017. EACL 2017

  • Tunisian Election Corpus (TEC) obtained from tweets about Tunisian elections in 2014.

Dataset link: TEC

Reference: Karim Sayadi, Marcus Liwicki, Rolf Ingold, Marc Bui. Tunisian Dialect and Modern Standard Arabic Dataset for Sentiment Analysis : Tunisian Election Context, IEEE-CICLing (Computational Linguistics and Intelligent Text Processing) Intl. conference, Konya, Turkey, 7-8 Avril 2016.

Tunisian Dialect Identification

Tunisian Arabic Dialects Identification(TADI): It is a binary classification task consisting of classifying Tunisian dialect and Non Tunisian dialect from an Arabic dialectical dataset.

Tunisian Algerian Dialect(TAD): It is a binary classification task consisting of classifying Tunisian dialect and Algerian dialect from an Arabic dialectical dataset.

The two datasets are available for download for research purposes:

TADI:

TAD:

Reading Comprehension Question-Answering

For this task, we built TRCD (Tunisian Reading Comprehension Dataset) as a Question-Answering dataset for Tunisian dialect. We used a dialectal version of the Tunisian constitution following the guideline in this article. It is composed of 144 documents where each document has exactly 3 paragraphs and three Question-Answer pairs are assigned to each paragraph. Questions were formulated by four Tunisian native speaker annotators and each question should be paired with a paragraph.

We made the dataset publicly available for research purposes:

TRCD:

Install

We use:

  • conda to setup our environment,
  • and python 3.7.9

Setup our environment:

# Clone the repo
git clone https://github.com/instadeepai/tunbert.git
cd tunbert

# Create a conda env
conda env create -f environment_torch.yml #bert-nvidia
conda env create -f environment_tf2.yml #bert-google

# Activate conda env
conda activate tunbert-torch #bert-nvidia
conda activate tf2-google #bert-google

# Install pre-commit hooks
pre-commit install

# Run all pre-commit checks (without committing anything)
pre-commit run --all-files

Project Structure

This is the folder structure of the project:

README.md             # This file :)
.gitlab-ci.yml        # CI with gitlab
.gitlab/              # Gitlab specific 
.pre-commit-config.yml  # The checks to run before every commit
environment_torch.yml       # contains the conda environment definition 
environment_tf2.yml       # contains the conda environment definition for pre-training bert-google
...

dev-data/             # data sample
    sentiment_analysis_tsac/
    dialect_classification_tadi/
    question_answering_trcd/

models/               # contains the different models to used 
    bert-google/
    bert-nvidia/

TunBERT-PyTorch

Fine-tune TunBERT-PyTorch on the Sentiment Analysis (SA) task

To fine-tune TunBERT-PyTorch on the SA task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_SA_DC.py --config-name "sentiment_analysis_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file_path="/path/to/train.tsv" model.validation_ds.file_path="/path/to/valid.tsv" model.test_ds.file_path="/path/to/test.tsv"

Fine-tune TunBERT-PyTorch on the Dialect Classification (DC) task

To fine-tune TunBERT-PyTorch on the DC task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_SA_DC.py --config-name "dialect_classification_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file_path="/path/to/train.tsv" model.validation_ds.file_path="/path/to/valid.tsv" model.test_ds.file_path="/path/to/test.tsv"

Fine-tune TunBERT-PyTorch on the Question Answering (QA) task

To fine-tune TunBERT-PyTorch on the QA task, you need to:

  • Run the following command-line:
python models/bert-nvidia/bert_finetuning_QA.py --config-name "question_answering_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file="/path/to/train.json" model.validation_ds.file="/path/to/val.json" model.test_ds.file="/path/to/test.json"

TunBERT-TensorFlow

Fine-tune TunBERT-TensorFlow on the Sentiment Analysis (SA) or Dialect Classification (DC) Task:

To fine-tune TunBERT-TensorFlow for a SA task or, you need to:

  • Specify the BERT_FOLDER_NAME in models/bert-google/finetuning_sa_tdid.sh.

    BERT_FOLDER_NAME should contain the config and vocab files and the checkpoint of your language model

  • Specify the DATA_FOLDER_NAME in models/bert-google/finetuning_sa_tdid.sh

  • Run:

bash models/bert-google/finetuning_sa_tdid.sh

Fine-tune TunBERT-TensorFlow on the Question Answering (QA) Task:

To fine-tune TunBERT-TensorFlow for a QA task, you need to:

  • Specify the BERT_FOLDER_NAME in models/bert-google/finetuning_squad.sh.

    BERT_FOLDER_NAME should contain the config and vocab files and the checkpoint of your language model

  • Specify the DATA_FOLDER_NAME in models/bert-google/finetuning_squad.sh

  • Run:

bash models/bert-google/finetuning_squad.sh

You can view the results, by launching tensorboard from your logging directory.

e.g. tensorboard --logdir=OUTPUT__FOLDER_NAME

Contact information

InstaDeep

iCompass

Comments
  • The prediction function skips a row when outputting

    The prediction function skips a row when outputting "test_results.tsv" results

    After training the module using finetuning_sa_tdid.sh , then running predictions on the test.tsv, the output of the prediction which gets written in test_results.tsv I noticed that the test file and the test_results file don't have same number of rows.

    for example the tunbert/dev-data/sentiment_analysis_tsac/test.tsv file that comes with your repo has 5 sentences/rows. When I run the predictions the tunbert/finetuning_tsac/test_results.tsv file outputs only 4 sentences/rows.

    my_issue

    Update: After spending more time tinkering with the model, It truns out that I didn't spot a minor issue when I was dealing with headers. Sorry for the inconvenience and the false issue. Great work, I really enjoyed it ^^

    opened by Alaa-1 1
  • Links to pre-trained weights are broken

    Links to pre-trained weights are broken

    Hello! I'd love to use your model, but the links to the pre-trained weights (both PyTorch and TensorFlow) are showing the following error:

    <Error>
      <Code>ProjectNotFound</Code>
      <Message>The requested project was not found.</Message>
      <Details>The requested project was not found.</Details>
    </Error>
    

    If you'd like, I can help you get the weights up on the Hugging Face Hub. We have documentation on how to do so, but I'm more than happy to help 😄

    cc: @osanseviero

    opened by NimaBoscarino 0
  • Pretraining data sharing

    Pretraining data sharing

    Hello, Thank you for sharing and the models.

    I was wondering if you can share the details about the pre-training data. Is it possible de share the data for pre-training?

    Thank you in advance.

    opened by ArijRB 0
  • Number of Steps/Epochs,  and Runtime

    Number of Steps/Epochs, and Runtime

    Could you please provide me with the number of Steps/Epochs, and total run time taken to train your pytorch model? Also, do you have a published/pre-print paper by any chance? Thanks in advance.

    opened by lezakkaz 0
  • ModuleNotFoundError: No module named 'nemo'

    ModuleNotFoundError: No module named 'nemo'

    While running the command python models/bert-nvidia/bert_finetuning_SA_DC.py --config-name "sentiment_analysis_config" model.language_model.lm_checkpoint="/path/to/checkpoints/PretrainingBERTFromText--end.ckpt" model.train_ds.file_path="/path/to/train.tsv" model.validation_ds.file_path="/path/to/valid.tsv" model.test_ds.file_path="/path/to/test.tsv" to fine-tune TunBERT-PyTorch on the SA task, I had the following error image I tried to install NeMo several times, but the error still persists.

    opened by nourbouaziz98 6
Owner
InstaDeep Ltd
InstaDeep offers a host of Enterprise AI products, ranging from GPU-accelerated insights to self-learning decision making systems.
InstaDeep Ltd
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 8, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

Meta Research 29 Nov 30, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 2, 2022
TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

Yixuan Su 26 Oct 17, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 3, 2023