AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Kelechi

Last update: Nov 24, 2022

Related tags

Deep Learning afriberta

Overview

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

This repository contains the code for the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages which appears in the first workshop on Multilingual Representation Learning at EMNLP 2021.

AfriBERTa was trained on 11 languages - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. AfriBERTa was evaluated on NER and text classification spanning 10 languages (some of which it was not pretrained on). It outperformed mBERT and XLM-R on several languages and is very competitive overall.

Pretrained models

We release the following pretrained models:

AfriBERTa Small (97M params)
AfriBERTa Base (111M params)
AfriBERTa Large (126M params)

Reproducing Experiments

Datasets and Tokenizer

Below are details on how to obtain the datasets and trained sentencepiece tokenizer:

Language Modelling: The data for language modelling can be downloaded from this URL

NER: To obtain the NER dataset, please download it from this repository

Text Classification: To obtain the topic classification dataset, please download it from this repository

Tokenizer: The trained sentencepiece tokenizer can be downloaded from this URL

Training

To train AfriBERTa and evaluate on both downstream tasks, simply install all requirements in requirements.txt, download the relevant datasets and run the following script:

bash run_all.sh

This script will:

Train the multilingual language model from scratch and save the model as well as relevant logs
Evaluate the trained language model on NER for all ten languages over 5 seeds
Evaluate the trained language model on text classification for all two languages over 5 seeds

Citation

@inproceedings{ogueji-etal-2021-small,
    title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
    author = "Ogueji, Kelechi  and
      Zhu, Yuxin  and
      Lin, Jimmy",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.11",
    pages = "116--126",
}

You might also like...

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

140 Dec 28, 2022

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Official code Cross-Covariance Image Transformer (XCiT)

605 Jan 2, 2023

Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

394 Jan 8, 2023

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Multilingual Unsupervised Sentence Simplification Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Sim

81 Dec 29, 2022

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

LinkBERT: A Knowledgeable Language Model Pretrained with Document Links This repo provides the model, code & data of our paper: LinkBERT: Pretraining

264 Jan 1, 2023

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

53 Nov 9, 2022

(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"

Res2Net The official pytorch implemention of the paper "Res2Net: A New Multi-scale Backbone Architecture" Our paper is accepted by IEEE Transactions o

928 Dec 29, 2022

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Face Recognition Using Pytorch Python 3.7 3.6 3.5 Status This is a repository for Inception Resnet (V1) models in pytorch, pretrained on VGGFace2 and

3.3k Jan 4, 2023

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

169 Dec 26, 2022

Comments

Question: How long did it take to pre-train afriberta?

I see in the paper Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resource Languages the configurations and hardware (2 x V100) used to pre-train all models, however, I cannot find any details on the estimated time to pre-train the each model. Is there any estimate on the time it would take to train such models (on similar hardware)?

opened by RuanVisser 1

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Related tags

Overview

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

Pretrained models

Reproducing Experiments

Datasets and Tokenizer

Training

Citation

You might also like...

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Code for "LoRA: Low-Rank Adaptation of Large Language Models"

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

(ImageNet pretrained models) The official pytorch implemention of the TPAMI paper "Res2Net: A New Multi-scale Backbone Architecture"

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Comments

Question: How long did it take to pre-train afriberta?

Owner

Kelechi

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

Meta Language-Specific Layers in Multilingual Language Models

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Code for HLA-Face: Joint High-Low Adaptation for Low Light Face Detection (CVPR21)

Official code of "R2RNet: Low-light Image Enhancement via Real-low to Real-normal Network."

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Using pretrained language models for biomedical knowledge graph completion.

Measuring and Improving Consistency in Pretrained Language Models