Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

AdapterHub

Last update: Aug 4, 2022

Related tags

Deep Learning hgiyt

Overview

Introduction

This repository contains research code for the ACL 2021 paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models". Feel free to use this code to re-run our experiments or run new experiments on your own data.

Setup

General

Clone this repo

git clone [email protected]:Adapter-Hub/hgiyt.git

Install PyTorch (we used v1.7.1 - code may not work as expected for older or newer versions) in a new Python (>=3.6) virtual environment

pip install torch===1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

Initialize the submodules

git submodule update --init --recursive

Install the adapter-transformer library and dependencies

pip install lib/adapter-transformers
pip install -r requirements.txt

Pretraining

Install Nvidia Apex for automatic mixed-precision (amp / fp16) training

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Install wiki-bert-pipeline dependencies

pip install -r lib/wiki-bert-pipeline/requirements.txt

Language-specific prerequisites

To use the Japanese monolingual model, install the morphological parser MeCab with the mecab-ipadic-20070801 dictionary:

Install gdown for easy downloads from Google Drive

pip install gdown

Download and install MeCab

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7cENtOXlicTFaRUE
tar -xvzf mecab-0.996.tar.gz
cd mecab-0.996
./configure 
make
make check
sudo make install

Download and install the mecab-ipadic-20070801 dictionary

gdown https://drive.google.com/uc?id=0B4y35FiV1wh7MWVlSDBCSXZMTXM
tar -xvzf mecab-ipadic-2.7.0-20070801.tar.gz
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8
make
sudo make install

Data

We unfortunately cannot host the datasets used in our paper in this repo. However, we provide download links (wherever possible) and instructions or scripts to preprocess the data for finetuning and for pretraining.

Experiments

Our scripts are largely borrowed from the transformers and adapter-transformers libraries. For pretrained models and adapters we rely on the ModelHub and AdapterHub. However, even if you haven't used them before, running our scripts should be pretty straightforward :).

We provide instructions on how to execute our finetuning scripts here and our pretraining script here.

Models

Our pretrained models are also available in the ModelHub: https://huggingface.co/hgiyt. Feel free to finetune them with our scripts or use them in your own code.

Citation & Authors

@inproceedings{rust-etal-2021-good,
      title     = {How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models}, 
      author    = {Phillip Rust and Jonas Pfeiffer and Ivan Vuli{\'c} and Sebastian Ruder and Iryna Gurevych},
      year      = {2021},
      booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational
                  Linguistics, {ACL} 2021, Online, August 1-6, 2021},
      url       = {https://arxiv.org/abs/2012.15613},
      pages     = {3118--3135}
}

Contact Person: Phillip Rust, [email protected]

Don't hesitate to send us an e-mail or report an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

You might also like...

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 3, 2023

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow

18 Oct 6, 2022

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

21.1k Jan 1, 2023

A Sklearn-like Framework for Hyperparameter Tuning and AutoML in Deep Learning projects. Finally have the right abstractions and design patterns to properly do AutoML. Let your pipeline steps have hyperparameter spaces. Enable checkpoints to cut duplicate calculations. Go from research to production environment easily.

Neuraxle Pipelines Code Machine Learning Pipelines - The Right Way. Neuraxle is a Machine Learning (ML) library for building machine learning pipeline

555 Dec 24, 2022

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

11.9k Feb 13, 2021

A research toolkit for particle swarm optimization in Python

PySwarms is an extensible research toolkit for particle swarm optimization (PSO) in Python. It is intended for swarm intelligence researchers, practit

1k Dec 30, 2022

Plato: A New Framework for Federated Learning Research

a new software framework to facilitate scalable federated learning research.

192 Jan 5, 2023

Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

pihole-antitelemetry Research shows Google collects 20x more data from Android than Apple collects from iOS. Block both using these pihole lists. Proj

290 Jan 9, 2023

A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks A Research-oriented Federated Learning Library and Benchmark Platform

175 Dec 1, 2022

Comments

Version of UD-Treebanks used for Tokenizer Experiments

Hello,

I was trying to reproduce some of the experiments (related to tokenizer metrics) in the paper and I am getting slightly different values for the fertility and continuation metrics. I was wondering if I was using a different version of UD-Treebank (I am using 2.8) that might be causing this discrepancy. It would be great if you can help me with the version used for the experiments in the paper.

Thanks

opened by kabirahuja2431 4
"--with-charset=utf8" option is needed for the Mecab install

Thanks for releasing the codes. I am testing your codes for Japanese.

In the "Language-specific prerequisites" section of "Setup" in README.md, I think --with-charset=utf8 option is needed for ./configure in "install MeCab" and "install the mecab-ipadic-20070801 dictionary" because the default encoding is euc-jp.

Thanks in advance.

opened by tomohideshibata 2
finetuning: add missing use_fast arg to argument parser

Hi guys,

I'm currently using this great repo for re-evaluating all my models in the Turkish LM model zoo.

When using the fine-tuning script for sentiment analysis, I just found out that the use_fast argument is not added to the hf argument parser, and this throws an execption later, when it will be used:

https://github.com/Adapter-Hub/hgiyt/blob/9db6b6d1aab264cdf9e3131adb8157bcaf162410/finetuning/sa/run_sa.py#L141

This PR fixes it :)

opened by stefan-it 2
Character-tokenized vs subword-tokenized in Japanese

In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.

I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:

ja:高 B-LOC ja:島 I-LOC ja:市 I-LOC ja:周 O ja:辺 O ↓ ja:高島 B-LOC ja:市 I-LOC ja:周辺 O

(I will perform this conversion, and test the subword-tokenized model later.)

All the other datasets are word-based. I have tested the character-tokenized model cl-tohoku/bert-base-japanese-char, which is used in the paper, and subword-tokenized model cl-tohoku/bert-base-japanese (with only one seed (seed = 1)). We can see that the subword-tokenized model is consistently better than the character-based model.

| | SA | UDP (UAS/LAS) | POS | | ---- | ---- | ---- | ---- | | Monolingual (paper) | 88.0 | 94.7 / 93.0 | 98.1 | | character-tokenized (mine) | 88.4 | 94.8 / 93.1 | 98.1 | | subword-tokenized (mine) | 91.1 | 95.0 / 93.4 | 98.2 |

It would be great if you could confirm this result.

opened by tomohideshibata 1

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

Related tags

Overview

Introduction

Setup

Data

Experiments

Models

Citation & Authors

You might also like...

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

A research toolkit for particle swarm optimization in Python

Plato: A New Framework for Federated Learning Research

Research shows Google collects 20x more data from Android than Apple collects from iOS. Block this non-consensual telemetry using pihole blocklists.

A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

Comments

Version of UD-Treebanks used for Tokenizer Experiments

"--with-charset=utf8" option is needed for the Mecab install

finetuning: add missing use_fast arg to argument parser

Character-tokenized vs subword-tokenized in Japanese

Owner

AdapterHub

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Pytorch port of Google Research's LEAF Audio paper

The official implementation of the research paper "DAG Amendment for Inverse Control of Parametric Shapes"

An implementation of the research paper "Retina Blood Vessel Segmentation Using A U-Net Based Convolutional Neural Network"

Multiple paper open-source codes of the Microsoft Research Asia DKI group

Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

A sample pytorch Implementation of ACL 2021 research paper "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

This repository contains code released by Google Research.