Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

Cardiff NLP

Last update: Dec 27, 2022

Related tags

Deep Learning xlm-t

Overview

This is the XLM-T repository, which includes data, code and pre-trained multilingual language models for Twitter.

XLM-T - A Multilingual Language Model Toolkit for Twitter

As explained in the reference paper, we make start from XLM-Roberta base and continue pre-training on a large corpus of Twitter in multiple languages. This masked language model, which we named twitter-xlm-roberta-base in the 🤗 Huggingface hub, can be downloaded from here.

Note: This Twitter-specific pretrained LM was pretrained following a similar strategy to its English-only counterpart, which was introduced as part of the TweetEval framework, and available here.

We also provide task-specific models based on the Adapter technique, fine-tuned for cross-lingual sentiment analysis (See #2):

1 - Code

We include code with various functionalities to complement this release. We provide examples for, among others, feature extraction and adapter-based inference with language models in this notebook. Also with examples for training and evaluating language models on multiple tweet classification tasks, compatible with UMSAB (see #2) and TweetEval datasets.

Perform inference with Huggingface's pipelines

Using Huggingface's pipelines, obtaining predictions is as easy as:

from transformers import pipeline
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
sentiment_task("Huggingface es lo mejor! Awesome library 🤗😎")

[{'label': 'Positive', 'score': 0.9343640804290771}]

Fine-tune `xlm-t` with adapters

You can fine-tune an adapter built on top of your language model of choice by running the src/adapter_finetuning.py script, for example:

python3 src/adapter_finetuning.py --language spanish --model cardfiffnlp/twitter-xlm-roberta-base --seed 1 --lr 0.0001 --max_epochs 20

Notebooks

For quick prototyping, you can direclty use the Colab notebooks we provide below:

Notebook	Description	Colab Link
01: Playgroud examples	Minimal start examples
02: Extract embeddings	Extract embeddings from tweets
03: Sentiment prediction	Predict sentiment
04: Fine-tuning	Fine-tune a model on custom data

2 - `UMSAB`, the Unified Multilingual Sentiment Analysis Benchmark

As part of our framework, we also release a unified benchmark for cross-lingual sentiment analysis for eight different languages. All datasets are framed as tweet classification with three labels (positive, negative and neutral). The languages included in the benchmark, as well as the datasets they are based on, are: Arabic (SemEval-2017, Rosenthal et al. 2017), English (SemEval-17, Rosenthal et al. 2017), French (Deft-2017, Benamara et al. 2017), German (SB-10K, Cieliebak et al. 2017), Hindi (SAIL 2015, Patra et al. 2015), Italian (Sentipolc-2016, Barbieri et al. 2016), Portuguese (SentiBR, Brum and Nunes, 2017) and Spanish (Intertass 2017, Díaz Galiano et al. 2018). The format for each dataset follows that of TweetEval with one line per tweet and label per line.

`UMSAB` Results / Leaderboard

The following results (Macro F1 reported) correspond to XLM-R (Conneau et al. 2020) and XLM-Tw, the same model retrained on Twitter as explained in the reference paper. The two settings are monolingual (trained and tested in the same language) and multilingual (considering all languages for training). Check the reference paper for more details on the setting and the metrics.

	FT Mono	XLM-R Mono	XLM-Tw Mono	XLM-R Multi	XLM-Tw Multi
Arabic	46.0	63.6	67.7	64.3	66.9
English	50.9	68.2	66.9	68.5	70.6
French	54.8	72.0	68.2	70.5	71.2
German	59.6	73.6	76.1	72.8	77.3
Hindi	37.1	36.6	40.3	53.4	56.4
Italian	54.7	71.5	70.9	68.6	69.1
Portuguese	55.1	67.1	76.0	69.8	75.4
Spanish	50.1	65.9	68.5	66.0	67.9
All lang.	51.0	64.8	66.8	66.8	69.4

If you would like to have your results added to the leaderboard you can either submit a pull request or send an email to any of the paper authors with results and the predictions of your model. Please also submit a reference to a paper describing your approach.

Evaluating your system

For evaluating your system according to Macro-F1, you simply need an individual prediction file for each of the languages. The format of the predictions file should be the same as the output examples in the predictions folder (one output label per line as per the original test file) and the files should be named language.txt (e.g. arabic.txt or all.txt if evaluating all languages at once). The predictions included as an example in this repo correspond to xlm-t trained and evaluated on all languages (All lang.).

Example usage

python src/evaluation_script.py

The script takes as input a set of test labels and the predictions from the "predictions" folder by default, but you can set this to suit your needs as optional arguments.

Optional arguments

Three optional arguments can be modified:

--gold_path: Path to gold datasets. Default: ./data/sentiment

--predictions_path: Path to predictions directory. Default: ./predictions/sentiment

--language: Language to evaluate (arabic, english ... or all). Default: all

Evaluation script sample usage from the terminal with parameters:

python src/evaluation_script.py --gold_path ./data/sentiment --predictions_path ./predictions/sentiment --language arabic

(this script would output the results for the Arabic dataset only)

Reference paper

If you use this repository in your research, please use the following bib entry to cite the reference paper.

@inproceedings{barbieri2021xlmtwitter,
  title={{A Multilingual Language Model Toolkit for Twitter}},
  author={Barbieri, Francesco and Espinosa-Anke, Luis and Camacho-Collados, Jose},
  booktitle={arXiv preprint arXiv:2104.12250},
  year={2021}
}

If using UMSAB, please also cite their corresponding datasets.

License

This repository is released open-source but but restrictions may apply to individual datasets (which are derived from existing data) or Twitter (main data source). We refer users to the original licenses accompanying each dataset and Twitter regulations.

Comments

ValueError: You have to specify either input_ids or inputs_embeds
ubuntu16.04 adapter-transformers==1.1.1 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:81:00.0 Off | N/A | | 41% 26C P8 20W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

when I run adapter_fintuning.py I got this error:

root@ubuntu:/home/project/xlm-t-main# python src/adapter_finetuning.py Some weights of the model checkpoint at cardiffnlp/twitter-xlm-roberta-base were not used when initializing XLMRobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']

This IS expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of XLMRobertaModelWithHeads were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last): File "src/adapter_finetuning.py", line 157, in trainer.train() File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 787, in train tr_loss += self.training_step(model, inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 1138, in training_step loss = self.compute_loss(model, inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 1162, in compute_loss outputs = model(**inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 805, in forward return_dict=return_dict, File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 685, in forward raise ValueError("You have to specify either input_ids or inputs_embeds") ValueError: You have to specify either input_ids or inputs_embeds 0%| | 0/200 [00:00<?, ?it/s] (train_cpu) root@ubuntu:/home/project/xlm-t-main# python src/adapter_finetuning.py Some weights of the model checkpoint at cardiffnlp/twitter-xlm-roberta-base were not used when initializing XLMRobertaModelWithHeads: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.decoder.bias']

This IS expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing XLMRobertaModelWithHeads from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of XLMRobertaModelWithHeads were not initialized from the model checkpoint at cardiffnlp/twitter-xlm-roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 0%| | 0/200 [00:00<?, ?it/s]Traceback (most recent call last): File "src/adapter_finetuning.py", line 157, in trainer.train() File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 787, in train tr_loss += self.training_step(model, inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 1138, in training_step loss = self.compute_loss(model, inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/trainer.py", line 1162, in compute_loss outputs = model(**inputs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 805, in forward return_dict=return_dict, File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/anaconda3/envs/train_cpu/lib/python3.7/site-packages/transformers/modeling_roberta.py", line 685, in forward raise ValueError("You have to specify either input_ids or inputs_embeds") ValueError: You have to specify either input_ids or inputs_embeds 0%| | 0/200 [00:00<?, ?it/s]

anybody can help?
opened by longshared 0
readme dead link

The work on this is excellent. I've been using for fine-tuning sentiment / topic models with great success!

One very nitpicky thing I noticed in the repo is that the README.md contains a link to https://huggingface.co/docs/transformers/model_doc/xlmroberta which is now broken. I believe that this should be https://huggingface.co/docs/transformers/model_doc/xlm-roberta.

opened by AndrewBoney 0
Run on Colab CRASH

I follow the step to run the model using a dataset with 5k tweets. However, when doing the 'output = model(**encoded_input)' the colab pro+ always crashes. Is there any solution to this issue? Thanks a lot!!!!

opened by zwz-111 0
Distilled or smaller pretrained models?

Hi

xlm-t works really well on my dataset however to make it effective performance-wise I was wondering if you guys are planning to release smaller pretrained or distilled models?

Best Anup

opened by kaliaanup 0
Issues with finetuning colab notebook

The fine-tuning colab notebook doesn't fully work.

The first cell in the Fine-tuning section errors out. I'm guessing it's probably due to specific versions of packages being required (which are not enforced in the first cell).

Also, as a very minor additional thing, this link is broken: This notebook was modified from https://huggingface.co/transformers/custom_datasets.html.

opened by micahcarroll 0
Potential (very minor) bug in fine-tuning script

I believe this line may be a bug, as it overrides the global variable set earlier in the file:

https://github.com/cardiffnlp/xlm-t/blob/aa9b15ef035876cec53d4ce55c04ec2e93acb92d/src/adapter_finetuning.py#L63

opened by micahcarroll 1

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

Related tags

Overview

XLM-T - A Multilingual Language Model Toolkit for Twitter

1 - Code

Perform inference with Huggingface's pipelines

Fine-tune xlm-t with adapters

Notebooks

2 - UMSAB, the Unified Multilingual Sentiment Analysis Benchmark

UMSAB Results / Leaderboard

Evaluating your system

Example usage

Optional arguments

Reference paper

License

Comments

ValueError: You have to specify either input_ids or inputs_embeds

readme dead link

Run on Colab CRASH

Distilled or smaller pretrained models?

Issues with finetuning colab notebook

Potential (very minor) bug in fine-tuning script

Owner

Cardiff NLP

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Meta Language-Specific Layers in Multilingual Language Models

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

torchlm is aims to build a high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating different engineering tricks that make RL work

BARTScore: Evaluating Generated Text as Text Generation

A collection of metrics for evaluating timbre dissimilarity using the TorchMetrics API

Fine-tune `xlm-t` with adapters

2 - `UMSAB`, the Unified Multilingual Sentiment Analysis Benchmark

`UMSAB` Results / Leaderboard