XLM

NEW: Added XLM-R model.

PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes:

Monolingual language model pretraining (BERT)
Cross-lingual language model pretraining (XLM)
Applications: Supervised / Unsupervised MT (NMT / UNMT)
Applications: Cross-lingual text classification (XNLI)
Product-Key Memory Layers (PKM)

XLM supports multi-GPU and multi-node training, and contains code for:

Language model pretraining:
- Causal Language Model (CLM)
- Masked Language Model (MLM)
- Translation Language Model (TLM)
GLUE fine-tuning
XNLI fine-tuning
Supervised / Unsupervised MT training:
- Denoising auto-encoder
- Parallel data training
- Online back-translation

Installation

Install the python package in editable mode with

pip install -e .

Dependencies

Python 3
NumPy
PyTorch (currently tested on version 0.4 and 1.0)
fastBPE (generate and apply BPE codes)
Moses (scripts to clean and tokenize text only - no installation required)
Apex (for fp16 training)

I. Monolingual language model pretraining (BERT)

In what follows we explain how you can download and use our pretrained XLM (English-only) BERT model. Then we explain how you can train your own monolingual model, and how you can fine-tune it on the GLUE tasks.

Pretrained English model

We provide our pretrained XLM_en English model, trained with the MLM objective.

Languages	Pretraining	Model	BPE codes	Vocabulary
English	MLM	Model	BPE codes	Vocabulary

which obtains better performance than BERT (see the GLUE benchmark) while trained on the same data:

Model	Score	CoLA	SST2	MRPC	STS-B	QQP	MNLI_m	MNLI_mm	QNLI	RTE	WNLI	AX
`BERT`	80.5	60.5	94.9	89.3/85.4	87.6/86.5	72.1/89.3	86.7	85.9	92.7	70.1	65.1	39.6
`XLM_en`	82.8	62.9	95.6	90.7/87.1	88.8/88.2	73.2/89.8	89.1	88.5	94.0	76.0	71.9	44.7

If you want to play around with the model and its representations, just download the model and take a look at our ipython notebook demo.

Our XLM PyTorch English model is trained on the same data than the pretrained BERT TensorFlow model (Wikipedia + Toronto Book Corpus). Our implementation does not use the next-sentence prediction task and has only 12 layers but higher capacity (665M parameters). Overall, our model achieves a better performance than the original BERT on all GLUE tasks (cf. table above for comparison).

Train your own monolingual BERT model

Now it what follows, we will explain how you can train a similar model on your own data.

1. Preparing the data

First, get the monolingual data (English Wikipedia, the TBC corpus is not hosted anymore).

# Download and tokenize Wikipedia data in 'data/wiki/en.{train,valid,test}'
# Note: the tokenization includes lower-casing and accent-removal
./get-data-wiki.sh en

Install fastBPE and learn BPE vocabulary (with 30,000 codes here):

OUTPATH=data/processed/XLM_en/30k  # path where processed files will be stored
FASTBPE=tools/fastBPE/fast  # path to the fastBPE tool

# create output path
mkdir -p $OUTPATH

# learn bpe codes on the training set (or only use a subset of it)
$FASTBPE learnbpe 30000 data/wiki/txt/en.train > $OUTPATH/codes

Now apply BPE tokenization to train/valid/test files:

$FASTBPE applybpe $OUTPATH/train.en data/wiki/txt/en.train $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/valid.en data/wiki/txt/en.valid $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/test.en data/wiki/txt/en.test $OUTPATH/codes &

and get the post-BPE vocabulary:

cat $OUTPATH/train.en | $FASTBPE getvocab - > $OUTPATH/vocab &

Binarize the data to limit the size of the data we load in memory:

# This will create three files: $OUTPATH/{train,valid,test}.en.pth
# After that we're all set
python preprocess.py $OUTPATH/vocab $OUTPATH/train.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/valid.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/test.en &

2. Train the BERT model

Train your BERT model (without the next-sentence prediction task) on the preprocessed data:


python train.py

## main parameters
--exp_name xlm_en                          # experiment name
--dump_path ./dumped                       # where to store the experiment

## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en'                                 # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en'                           # MLM objective

## transformer parameters
--emb_dim 2048                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU

## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001  # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_en_mlm_ppl     # validation metric (when to save the best model)
--stopping_criterion _valid_en_mlm_ppl,25  # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training

## bert parameters
--word_mask_keep_rand '0.8,0.1,0.1'        # bert masking probabilities
--word_pred '0.15'                         # predict 15 percent of the words

## There are other parameters that are not specified here (see train.py).

To train with multiple GPUs use:

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

Tips: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.

3. Fine-tune a pretrained model on GLUE tasks

Now that the model is pretrained, let's finetune it. First, download and preprocess the GLUE tasks:

# Download and tokenize GLUE tasks in 'data/glue/{MNLI,QNLI,SST-2,STS-B}'

./get-data-glue.sh

# Preprocessing should be the same than for training.
# If you removed lower-casing/accent-removal, it sould be reflected here as well.

and prepare the GLUE data using the codes and vocab:

# by default this script uses the BPE codes and vocab of pretrained XLM_en. Modify in script if needed.
./prepare-glue.sh

In addition to the train.py script, we provide a complementary script glue-xnli.py to fine-tune a model on either GLUE or XNLI.

You can now fine-tune the pretrained model on one of the English GLUE tasks using this config:

# Config used for fine-tuning our pretrained English BERT model (mlm_en_2048.pth)
python glue-xnli.py
--exp_name test_xlm_en_glue              # experiment name
--dump_path ./dumped                     # where to store the experiment
--model_path mlm_en_2048.pth             # model location
--data_path $OUTPATH                     # data location
--transfer_tasks MNLI-m,QNLI,SST-2       # transfer tasks (GLUE tasks)
--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--finetune_layers "0:_1"                 # fine-tune all layers
--batch_size 8                           # batch size (\in [4, 8])
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch (relatively small on purpose)
--max_len 256                            # max number of words in sentences
--max_vocab -1                           # max number of words in vocab

Tips: You should sweep over the batch size (4 and 8) and the learning rate (5e-6, 2.5e-5, 1.25e-4) parameters.

II. Cross-lingual language model pretraining (XLM)

XLM-R (new model)

XLM-R is the new state-of-the-art XLM model. XLM-R shows the possibility of training one model for many languages while not sacrificing per-language performance. It is trained on 2.5 TB of CommonCrawl data, in 100 languages. You can load XLM-R from torch.hub (Pytorch >= 1.1):

# XLM-R model
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval()

Apply sentence-piece-model (SPM) encoding to input text:

en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
xlmr.decode(en_tokens)  # 'Hello world!'

ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

zh_tokens = xlmr.encode('你好，世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens)  # '你好，世界'

Extract features from XLM-R:

# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Pretrained cross-lingual language models

We provide large pretrained models for the 15 languages of XNLI, and two other models in 17 and 100 languages.

Languages	Pretraining	Tokenization	Model	BPE codes	Vocabulary
15	MLM	tokenize + lowercase + no accent + BPE	Model	BPE codes (80k)	Vocabulary (95k)
15	MLM + TLM	tokenize + lowercase + no accent + BPE	Model	BPE codes (80k)	Vocabulary (95k)
17	MLM	tokenize + BPE	Model	BPE codes (175k)	Vocabulary (200k)
100	MLM	tokenize + BPE	Model	BPE codes (175k)	Vocabulary (200k)

which obtains better performance than mBERT on the XNLI cross-lingual classification task:

Model	lg	en	es	de	ar	zh	ur
`mBERT`	102	81.4	74.3	70.5	62.1	63.8	58.3
`XLM (MLM)`	15	83.2	76.3	74.2	68.5	71.9	63.4
`XLM (MLM+TLM)`	15	85.0	78.9	77.8	73.1	76.5	67.3
`XLM (MLM)`	17	84.8	79.4	76.2	71.5	75	-
`XLM (MLM)`	100	83.7	76.6	73.6	67.4	71.7	62.9

If you want to play around with the model and its representations, just download the model and take a look at our ipython notebook demo.

The 17 and 100 Languages

The XLM-17 model includes these languages: en-fr-es-de-it-pt-nl-sv-pl-ru-ar-tr-zh-ja-ko-hi-vi

The XLM-100 model includes these languages: en-es-fr-de-zh-ru-pt-it-ar-ja-id-tr-nl-pl-simple-fa-vi-sv-ko-he-ro-no-hi-uk-cs-fi-hu-th-da-ca-el-bg-sr-ms-bn-hr-sl-zh_yue-az-sk-eo-ta-sh-lt-et-ml-la-bs-sq-arz-af-ka-mr-eu-tl-ang-gl-nn-ur-kk-be-hy-te-lv-mk-zh_classical-als-is-wuu-my-sco-mn-ceb-ast-cy-kn-br-an-gu-bar-uz-lb-ne-si-war-jv-ga-zh_min_nan-oc-ku-sw-nds-ckb-ia-yi-fy-scn-gan-tt-am

Train your own XLM model with MLM or MLM+TLM

Now in what follows, we will explain how you can train an XLM model on your own data.

1. Preparing the data

Monolingual data (MLM): Follow the same procedure as in I.1, and download multiple monolingual corpora, such as the Wikipedias.

Note that we provide a tokenizer script:

lg=en
cat my_file.$lg | ./tools/tokenize.sh $lg > my_tokenized_file.$lg &

Parallel data (TLM): We provide download scripts for some language pairs in the get-data-para.sh script.

# Download and tokenize parallel data in 'data/wiki/para/en-zh.{en,zh}.{train,valid,test}'
./get-data-para.sh en-zh &

For other language pairs, look at the OPUS collection, and modify the get-data-para.sh script [here)(https://github.com/facebookresearch/XLM/blob/master/get-data-para.sh#L179-L180) to add your own language pair.

Now create you training set for the BPE vocabulary, for instance by taking 100M sentences from each monolingua corpora.

# build the training set for BPE tokenization (50k codes)
OUTPATH=data/processed/XLM_en_zh/50k
mkdir -p $OUTPATH
shuf -r -n 10000000 data/wiki/train.en >> $OUTPATH/bpe.train
shuf -r -n 10000000 data/wiki/train.zh >> $OUTPATH/bpe.train

And learn the 50k BPE code as in the previous section on the bpe.train file. Apply BPE tokenization on the monolingual and parallel corpora, and binarize everything using preprocess.py:

pair=en-zh

for lg in $(echo $pair | sed -e 's/\-/ /g'); do
  for split in train valid test; do
    $FASTBPE applybpe $OUTPATH/$pair.$lg.$split data/wiki/para/$pair.$lg.$split $OUTPATH/codes
    python preprocess.py $OUTPATH/vocab $OUTPATH/$pair.$lg.$split
  done
done

2. Train the XLM model

Train your XLM (MLM only) on the preprocessed data:

python train.py

## main parameters
--exp_name xlm_en_zh                       # experiment name
--dump_path ./dumped                       # where to store the experiment

## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en-zh'                              # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en,zh'                        # MLM objective

## transformer parameters
--emb_dim 1024                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU

## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam,lr=0.0001                 # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_mlm_ppl        # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,25     # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training

## There are other parameters that are not specified here (see [here](https://github.com/facebookresearch/XLM/blob/master/train.py#L24-L198)).

Here the validation metrics _valid_mlm_ppl is the average of MLM perplexities.

MLM+TLM model: If you want to add TLM on top of MLM, just add "en-zh" language pair in mlm_steps:

--mlm_steps 'en,zh,en-zh'                  # MLM objective

Tips: You can also pretrain your model with MLM-only, and then continue training with MLM+TLM with the --reload_model parameter.

3. Fine-tune XLM models (Applications, see below)

Cross-lingual language model (XLM) provides a strong pretraining method for cross-lingual understanding (XLU) tasks. In what follows, we present applications to machine translation (unsupervised and supervised) and cross-lingual classification (XNLI).

III. Applications: Supervised / Unsupervised MT

XLMs can be used as a pretraining method for unsupervised or supervised neural machine translation.

Pretrained XLM(MLM) models

The English-French, English-German and English-Romanian models are the ones we used in the paper for MT pretraining. They are trained with monolingual data only, with the MLM objective. If you use these models, you should use the same data preprocessing / BPE codes to preprocess your data. See the preprocessing commands in get-data-nmt.sh.

Languages	Pretraining	Model	BPE codes	Vocabulary
English-French	MLM	Model	BPE codes	Vocabulary
English-German	MLM	Model	BPE codes	Vocabulary
English-Romanian	MLM	Model	BPE codes	Vocabulary

Download / preprocess data

To download the data required for the unsupervised MT experiments, simply run:

git clone https://github.com/facebookresearch/XLM.git
cd XLM

And one of the three commands below:

./get-data-nmt.sh --src en --tgt fr
./get-data-nmt.sh --src de --tgt en
./get-data-nmt.sh --src en --tgt ro

for English-French, German-English, or English-Romanian experiments. The script will successively:

download Moses scripts, download and compile fastBPE
download, extract, tokenize, apply BPE to monolingual and parallel test data
binarize all datasets

If you want to use our pretrained models, you need to have an exactly identical vocabulary. Since small differences can happen during preprocessing, we recommend that you use our BPE codes and vocabulary (although you should get something almost identical if you learn the codes and compute the vocabulary yourself). This will ensure that the vocabulary of your preprocessed data perfectly matches the one of our pretrained models, and that there is not a word / index mismatch. To do so, simply run:

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

get-data-nmt.sh contains a few parameters defined at the beginning of the file:

N_MONO number of monolingual sentences for each language (default 5000000)
CODES number of BPE codes (default 60000)
N_THREADS number of threads in data preprocessing (default 16)

The default number of monolingual data is 5M sentences, but using more monolingual data will significantly improve the quality of pretrained models. In practice, the models we release for MT are trained on all NewsCrawl data available, i.e. about 260M, 200M and 65M sentences for German, English and French respectively.

The script should output a data summary that contains the location of all files required to start experiments:

===== Data summary
Monolingual training data:
    en: ./data/processed/en-fr/train.en.pth
    fr: ./data/processed/en-fr/train.fr.pth
Monolingual validation data:
    en: ./data/processed/en-fr/valid.en.pth
    fr: ./data/processed/en-fr/valid.fr.pth
Monolingual test data:
    en: ./data/processed/en-fr/test.en.pth
    fr: ./data/processed/en-fr/test.fr.pth
Parallel validation data:
    en: ./data/processed/en-fr/valid.en-fr.en.pth
    fr: ./data/processed/en-fr/valid.en-fr.fr.pth
Parallel test data:
    en: ./data/processed/en-fr/test.en-fr.en.pth
    fr: ./data/processed/en-fr/test.en-fr.fr.pth

Pretrain a language model (with MLM)

The following script will pretrain a model with the MLM objective for English and French:

python train.py

## main parameters
--exp_name test_enfr_mlm                # experiment name
--dump_path ./dumped/                   # where to store the experiment

## data location / training objective
--data_path ./data/processed/en-fr/     # data location
--lgs 'en-fr'                           # considered languages
--clm_steps ''                          # CLM objective
--mlm_steps 'en,fr'                     # MLM objective

## transformer parameters
--emb_dim 1024                          # embeddings / model dimension
--n_layers 6                            # number of layers
--n_heads 8                             # number of heads
--dropout 0.1                           # dropout
--attention_dropout 0.1                 # attention dropout
--gelu_activation true                  # GELU instead of ReLU

## optimization
--batch_size 32                         # sequences per batch
--bptt 256                              # sequences length
--optimizer adam,lr=0.0001              # optimizer
--epoch_size 200000                     # number of sentences per epoch
--validation_metrics _valid_mlm_ppl     # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,10  # end experiment if stopping criterion does not improve

If parallel data is available, the TLM objective can be used with --mlm_steps 'en-fr'. To train with both the MLM and TLM objective, you can use --mlm_steps 'en,fr,en-fr'. We provide models trained with the MLM objective for English-French, English-German and English-Romanian, along with the BPE codes and vocabulary used to preprocess the data.

Train on unsupervised MT from a pretrained model

You can now use the pretrained model for Machine Translation. To download a model trained with the command above on the MLM objective, and the corresponding BPE codes, run:

wget -c https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth

If you preprocessed your dataset in ./data/processed/en-fr/ with the provided BPE codes codes_enfr and vocabulary vocab_enfr, you can pretrain your NMT model with mlm_enfr_1024.pth and run:

python train.py

## main parameters
--exp_name unsupMT_enfr                                       # experiment name
--dump_path ./dumped/                                         # where to store the experiment
--reload_model 'mlm_enfr_1024.pth,mlm_enfr_1024.pth'          # model to reload for encoder,decoder

## data location / training objective
--data_path ./data/processed/en-fr/                           # data location
--lgs 'en-fr'                                                 # considered languages
--ae_steps 'en,fr'                                            # denoising auto-encoder training steps
--bt_steps 'en-fr-en,fr-en-fr'                                # back-translation steps
--word_shuffle 3                                              # noise for auto-encoding loss
--word_dropout 0.1                                            # noise for auto-encoding loss
--word_blank 0.1                                              # noise for auto-encoding loss
--lambda_ae '0:1,100000:0.1,300000:0'                         # scheduling on the auto-encoding coefficient

## transformer parameters
--encoder_only false                                          # use a decoder for MT
--emb_dim 1024                                                # embeddings / model dimension
--n_layers 6                                                  # number of layers
--n_heads 8                                                   # number of heads
--dropout 0.1                                                 # dropout
--attention_dropout 0.1                                       # attention dropout
--gelu_activation true                                        # GELU instead of ReLU

## optimization
--tokens_per_batch 2000                                       # use batches with a fixed number of words
--batch_size 32                                               # batch size (for back-translation)
--bptt 256                                                    # sequence length
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001  # optimizer
--epoch_size 200000                                           # number of sentences per epoch
--eval_bleu true                                              # also evaluate the BLEU score
--stopping_criterion 'valid_en-fr_mt_bleu,10'                 # validation metric (when to save the best model)
--validation_metrics 'valid_en-fr_mt_bleu'                    # end experiment if stopping criterion does not improve

The parameters of your Transformer model have to be identical to the ones used for pretraining (or you will have to slightly modify the code to only reload existing parameters). After 8 epochs on 8 GPUs, the above command should give you something like this:

epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62

IV. Applications: Cross-lingual text classification (XNLI)

XLMs can be used to build cross-lingual classifiers. After fine-tuning an XLM model on an English training corpus for instance (e.g. of sentiment analysis, natural language inference), the model is still able to make accurate predictions at test time in other languages, for which there is very little or no training data. This approach is usually referred to as "zero-shot cross-lingual classification".

Get the right tokenizers

Before running the scripts below, make sure you download the tokenizers from the tools/ directory.

Download / preprocess monolingual data

Follow a similar approach than in section 1 for the 15 languages:

for lg in ar bg de el en es fr hi ru sw th tr ur vi zh; do
  ./get-data-wiki.sh $lg
done

Downloading the Wikipedia dumps make take several hours. The get-data-wiki.sh script will automatically download Wikipedia dumps, extract raw sentences, clean and tokenize them. Note that in our experiments we also concatenated the Toronto Book Corpus to the English Wikipedia, but this dataset is no longer hosted.

For Chinese and Thai you will need a special tokenizer that you can install using the commands below. For all other languages, the data will be tokenized with Moses scripts.

# Thai - https://github.com/PyThaiNLP/pythainlp
pip install pythainlp

# Chinese
cd tools/
wget https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip
unzip stanford-segmenter-2018-10-16.zip

Download parallel data

This script will download and tokenize the parallel data used for the TLM objective:

lg_pairs="ar-en bg-en de-en el-en en-es en-fr en-hi en-ru en-sw en-th en-tr en-ur en-vi en-zh"
for lg_pair in $lg_pairs; do
  ./get-data-para.sh $lg_pair
done

Apply BPE and binarize

Apply BPE and binarize data similar to section 2.

Pretrain a language model (with MLM and TLM)

The following script will pretrain a model with the MLM and TLM objectives for the 15 XNLI languages:

python train.py

## main parameters
--exp_name train_xnli_mlm_tlm            # experiment name
--dump_path ./dumped/                    # where to store the experiment

## data location / training objective
--data_path ./data/processed/XLM15/                   # data location
--lgs 'ar-bg-de-el-en-es-fr-hi-ru-sw-th-tr-ur-vi-zh'  # considered languages
--clm_steps ''                                        # CLM objective
--mlm_steps 'ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh,en-ar,en-bg,en-de,en-el,en-es,en-fr,en-hi,en-ru,en-sw,en-th,en-tr,en-ur,en-vi,en-zh,ar-en,bg-en,de-en,el-en,es-en,fr-en,hi-en,ru-en,sw-en,th-en,tr-en,ur-en,vi-en,zh-en'  # MLM objective

## transformer parameters
--emb_dim 1024                           # embeddings / model dimension
--n_layers 12                            # number of layers
--n_heads 8                              # number of heads
--dropout 0.1                            # dropout
--attention_dropout 0.1                  # attention dropout
--gelu_activation true                   # GELU instead of ReLU

## optimization
--batch_size 32                          # sequences per batch
--bptt 256                               # sequences length
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0  # optimizer
--epoch_size 200000                      # number of sentences per epoch
--validation_metrics _valid_mlm_ppl      # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,10   # end experiment if stopping criterion does not improve

Download XNLI data

This script will download and tokenize the XNLI corpus:

./get-data-xnli.sh

Preprocess data

This script will apply BPE using the XNLI15 bpe codes, and binarize data.

./prepare-xnli.sh

Fine-tune your XLM model on cross-lingual classification (XNLI)

You can now use the pretrained model for cross-lingual classification. To download a model trained with the command above on the MLM-TLM objective, run:

wget -c https://dl.fbaipublicfiles.com/XLM/mlm_tlm_xnli15_1024.pth

You can now fine-tune the pretrained model on XNLI, or on one of the English GLUE tasks:

python glue-xnli.py
--exp_name test_xnli_mlm_tlm             # experiment name
--dump_path ./dumped/                    # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth     # model location
--data_path ./data/processed/XLM15       # data location
--transfer_tasks XNLI,SST-2              # transfer tasks (XNLI or GLUE tasks)
--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--finetune_layers "0:_1"                 # fine-tune all layers
--batch_size 8                           # batch size (\in [4, 8])
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch
--max_len 256                            # max number of words in sentences
--max_vocab 95000                        # max number of words in vocab

V. Product-Key Memory Layers (PKM)

XLM also implements the Product-Key Memory layer (PKM) described in [4]. To add a memory in (for instance) the layers 4 and 7 of an encoder, you can simply provide --use_memory true --mem_enc_positions 4,7 as argument of train.py (and similarly for --mem_dec_positions and the decoder). All memory layer parameters can be found here. A minimalist and simple implementation of the PKM layer, that uses the same configuration as in the paper, can be found in this ipython notebook.

Frequently Asked Questions

How can I run experiments on multiple GPUs?

XLM supports both multi-GPU and multi-node training, and was tested with up to 128 GPUs. To run an experiment with multiple GPUs on a single machine, simply replace python train.py in the commands above with:

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

The multi-node is automatically handled by SLURM.

References

Please cite [1] if you found the resources in this repository useful.

Cross-lingual Language Model Pretraining

[1] G. Lample *, A. Conneau * Cross-lingual Language Model Pretraining

* Equal contribution. Order has been determined with a coin flip.

@article{lample2019cross,
  title={Cross-lingual Language Model Pretraining},
  author={Lample, Guillaume and Conneau, Alexis},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2019}
}

XNLI: Evaluating Cross-lingual Sentence Representations

[2] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, V. Stoyanov XNLI: Evaluating Cross-lingual Sentence Representations

@inproceedings{conneau2018xnli,
  title={XNLI: Evaluating Cross-lingual Sentence Representations},
  author={Conneau, Alexis and Lample, Guillaume and Rinott, Ruty and Williams, Adina and Bowman, Samuel R and Schwenk, Holger and Stoyanov, Veselin},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

Phrase-Based & Neural Unsupervised Machine Translation

[3] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation

@inproceedings{lample2018phrase,
  title={Phrase-Based \& Neural Unsupervised Machine Translation},
  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

Large Memory Layers with Product Keys

[4] G. Lample, A. Sablayrolles, MA. Ranzato, L. Denoyer, H. Jégou Large Memory Layers with Product Keys

@article{lample2019large,
  title={Large Memory Layers with Product Keys},
  author={Lample, Guillaume and Sablayrolles, Alexandre and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2019}
}

Unsupervised Cross-lingual Representation Learning at Scale

[5] A. Conneau *, K. Khandelwal *, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov Unsupervised Cross-lingual Representation Learning at Scale

* Equal contribution

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

License

See the LICENSE file for more details.

I tried unsupervised MT on Multiple GPUs on the below command

export CUDA_VISIBLE_DEVICES=2,7; export export NGPU=2; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name unsupMT_enfr --dump_path ./dumped/ --reload_model 'mlm_enfr_1024.pth,mlm_enfr_1024.pth' --data_path ./data/processed/en-fr/ --lgs 'en-fr' --ae_steps 'en,fr' --bt_steps 'en-fr-en,fr-en-fr' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 32 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-fr_mt_bleu,10' --validation_metrics 'valid_en-fr_mt_bleu'

and got below error

TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x7f0da5e05180>, [[tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([0., 0., 0.,  ..., 0., 0., 0.], device='cuda:0'), tensor([-0.0113,  0.0161,  0.0634,  ..., -0.0057, -0.0471,  0.0002],
       device='cuda:0'), None]], [0]

2GPUs are both "GeForce GTX TITAN X" and both have 12212 MB memory.

(I get below error when I just use one of the above GPUs.

RuntimeError: CUDA out of memory. Tried to allocate 472.75 MiB (GPU 0; 11.93 GiB total capacity; 10.35 GiB already allocated; 208.81 MiB free; 897.06 MiB cached)

Is it not enough with 12GB?

MLM+TLM model fails at STS-B task

The pretrained MLM+TLM model achieves < 30% Pearson correlation with human scores. With CBOW at 60% and BERT at 86%, this score seems low for the STS-B task.

I am not sure if there is a mistake in the implementation from my end or if the MLM+TLM model does not work for STS-B task. Can someone confirm this @aconneau, @glample

opened by Akella17 24
Performance of Unsupervised NMT with 5M monolingual data

Hi, @glample . Thank you for your nice contribution.

I have noticed the demo you released only uses 5M monolingual data. I have tried and it seems it can not achieve the accuracy paper reported, but i want to know what accuracy it will achieve under 5M monolingual data (just for reference). Can you provide some helps?

opened by StillKeepTry 17

Unable to run unsupervised MT on Multiple GPUs

opened by Dolprimates 13

The UNMT performance on De-En

I followed the same settings as you show. But I can not reproduce the results of unsupervised NMT on De-En while it is OK on Fr-En.

Does it need different parameters for De-En? Thank you very much.

opened by hpsun1109 12
How can I get the words embeddings?

Hello! Thank you for sharing this code!

Is there an easy way to get the embedding of a particular word? Those found in table 5. of the paper. Thank you!

opened by stygian2a 11
[Zero-shot] Classification performance

Hi guys,

first of all congrats to this great model! I do multilingual tweet classification and it performs stunningly for monolingual cases. I freeze the model and use a linear layer on top of the first output token.

However, I want to do zero-shot classification on Spanish and English tweets and get strange results. When I train the model with Spanish tweets and evaluate on English ones, it performs pretty well. However, if I do it vice versa, so I train on English tweets and evaluate on Spanish tweets, the performance goes down heavily (-20% F1).

Do you have an idea what could be the reason or do you have general hints regarding zero-shot learning with XLM? I would appreciate any help. Thank you very much in advance!

opened by naibaf1991 10
multi gpu slower than single setup

Hi,

I am not sure that but it looks like slowed down when use 2 gpu cards.

When I setup single gpu, training goes with 220 sent./s

But after using multi gpu (2 cards) with same params, training slows down to 85 sent. /sec.

And it generates 2 different foldes under dumped folder.

Is it normal behaviour or do I miss something?

Thanks

opened by hakkiyagiz 9
Translate Error
hello,thanks for your code, I run the train.sh as you writed in README, the command is:

CUDA_VISIBLE_DEVICES=0 nohup > nohup_3.log 2>&1 python3 train.py \ --exp_name test_enfr_mlm \ --dump_path ./dumped2/ \ --data_path ./data/processed/en-fr/ \ --lgs 'en-fr' \ --clm_steps '' \ --mlm_steps 'en,fr' \ --emb_dim 512 \ --n_layers 4 \ --n_heads 8 \ --dropout 0.1 \ --attention_dropout 0.1 \ --gelu_activation true \ --batch_size 32 \ --bptt 256 \ --optimizer adam,lr=0.0001 \ --epoch_size 200000 \ --validation_metrics _valid_mlm_ppl \ --stopping_criterion _valid_mlm_ppl,3 &

and then,I want to use the saved model(The program has not finished running，the model is saved during the running of the program) to translate some sentences, the command is:

head -n 10 /home/qtxue/dqxu/data/para/dev/newstest2014-fren-src.fr.60000 | \ CUDA_VISIBLE_DEVICES=4 python3 translate.py --exp_name translate \ --src_lang fr --tgt_lang en \ --model_path /home/qtxue/best-valid_mlm_ppl.pth --output_path /home/qtxue/output.en

some error appears: INFO - 04/25/19 10:48:43 - 0:00:00 - ============ Initialized logger ============ INFO - 04/25/19 10:48:43 - 0:00:00 - batch_size: 32 command: python translate.py --exp_name translate --src_lang fr --tgt_lang en --model_path '/home/qtxue/checkpoint.pth' --output_path '/home/qtxue/output.en' --exp_id "19njy282kc" dump_path: ./dumped/translate/19njy282kc exp_id: 19njy282kc exp_name: translate fp16: False model_path: /home/qtxue/checkpoint.pth output_path: /home/qtxue/output.en src_lang: fr tgt_lang: en INFO - 04/25/19 10:48:43 - 0:00:00 - The experiment will be stored in ./dumped/translate/19njy282kc

INFO - 04/25/19 10:48:43 - 0:00:00 - Running command: python translate.py --exp_name translate --src_lang fr --tgt_lang en --model_path '/home/qtxue/checkpoint.pth' --output_path '/home/qtxue/output.en'

INFO - 04/25/19 10:48:48 - 0:00:05 - Supported languages: en, fr Traceback (most recent call last): File "translate.py", line 150, in main(params) File "translate.py", line 80, in main encoder.load_state_dict(reloaded['encoder']) KeyError: 'encoder' is there some place need to modify or some thing is wrong in my operation?
opened by qtxue 9
training bleu is 0.0

I followed the instructions in this repo doing en-fr unsupervised MT using the pretraining mlm model, and after 24 hours of training, the bleu is 0.0. The parameters set to be: tokens per batch 200; batch size 2:

Anything else is the same as the instructions. my training log: fr_mt_ppl": 3493.2766576812446, "valid_en-fr_mt_acc": 4.494762971483926, "va lid_en-fr_mt_bleu": 0.0, "valid_fr-en_mt_ppl": 5467.123569852876, "valid_fr- en_mt_acc": 4.613142299283623, "valid_fr-en_mt_bleu": 0.0, "test_en-fr_mt_pp l": 3884.4537842660484, "test_en-fr_mt_acc": 4.106139624415247, "test_en-fr_ mt_bleu": 0.0, "test_fr-en_mt_ppl": 6325.922849001634, "test_fr-en_mt_acc": 3.9660845355606176, "test_fr-en_mt_bleu": 0.0}

I only used one single GPU with 12GB memory. If I only have one GPU rtx 2080Ti with 12GB memory, how can I get good result and how many hours do I need?

opened by klauspa 8
Replicating `XLM_en` results

Are you able to share the hyperparameters used to train the monolingual XLM_en model? I see some of the hyperparameters are included in the pretrained model, but it has some references to reloading from a checkpoints, etc.

Thanks!

opened by bkj 8
RuntimeError: CUDA out of memory. Tried to allocate 498.50 MiB (GPU 0; 7.92 GiB total capacity; 6.74 GiB already allocated; 307.56 MiB free; 3.53 MiB cached)

Hi，@glample

I pretrained a model with the MLM objective for Mongolian and Chinese, but when I used the pretrained model for mn-zh Machine Translation, the error came. I tried reducing --batch_size from default 32 to 16, 8, 4, 2, and 1, but that didn't help. Could you have any good solutions for this to share？

The pretrained result is： INFO - 02/28/19 09:47:19 - 1 day, 1:01:20 - ============ End of epoch 7 ============ INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - epoch -> 7.000000 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_ppl -> 20.055305 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_acc -> 56.151420 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_ppl -> 1813.456839 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_acc -> 28.312303 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_ppl -> 916.756072 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_acc -> 42.231861 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_ppl -> 8.259349 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_acc -> 65.375485 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_ppl -> 11569.002599 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_acc -> 15.452244 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_ppl -> 5788.630974 INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_acc -> 40.413864

Train on unsupervised MT from the pretrained model python train.py --exp_name unsupMT_mnzh --dump_path ./dumped/ --reload_model 'best-valid_mlm_ppl.pth,best-valid_mlm_ppl.pth' --data_path ./data/processed/mn-zh/ --lgs 'mn-zh' --ae_steps 'mn,zh' --bt_steps 'mn-zh-mn,zh-mn-zh' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 16 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.999,lr=0.0001 --epoch_size 300000 --eval_bleu true --stopping_criterion 'valid_mn-zh_mt_bleu,10' --validation_metrics 'valid_mn-zh_mt_bleu'

opened by Julisa-test 8

confusion about `lm_head`'s size?

In [58] xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
In [59]: xlmr.model.encoder.lm_head
Out[59]:
RobertaLMHead(
  (dense): Linear(in_features=1024, out_features=1024, bias=True)
  (layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
In [60]: xlmr.model.encoder.lm_head.weight.size()
Out[60]: torch.Size([250002, 1024])

In [61]: xlmr.model.encoder.lm_head.bias.size()
Out[61]: torch.Size([250002])

If I understand correctly, the lm_head is simply the word embedding in tied embedding case. What I don't understand is why it shows a dense layer of size [1024, 1024] but upon inspecting the weight and bias, it shows [250002, 1024]? I would assume [250002, 1024] is the correct one.

opened by tnq177 2

Checkpoint for TLM objective

Thanks for the great work! I was wondering if there is checkpoint released solely on TLM or TLM+MLM objectives? Because I am interested in using TLM objective. for my downstream application. Thanks!

opened by xu1998hz 0
[Question] Does XLM-R follows RoBERTa or XLM for MLM?

Hugging Face states that:

It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

While XLM-R paper states:

We follow the XLM approach as closely as possible, only introducing changes that improve performance at scale.

The confusion is RoBERTa uses dynamic masking whereas XLM uses static one. Can somebody explain me what exactly is XLM-R doing in MLM?

opened by mani-rai 0
How is sentence piece model trained in XLM-R?
I understand training sentence piece model in monolingual case. But in multilingual case, its not clear enough. It's because dataset sizes across languages varies greatly. I think this leads to biased shared vocabulary.

Is it using sampling technique while training sentence piece as well?

If yes, how many times is sampling performed?

Isn't it better to go through all the text in dataset to create sub-words vocab instead of just the samples?
opened by mani-rai 0

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Related tags

Overview

XLM

Installation

Dependencies

I. Monolingual language model pretraining (BERT)

Pretrained English model

Train your own monolingual BERT model

1. Preparing the data

2. Train the BERT model

3. Fine-tune a pretrained model on GLUE tasks

II. Cross-lingual language model pretraining (XLM)

XLM-R (new model)

Pretrained cross-lingual language models

The 17 and 100 Languages

Train your own XLM model with MLM or MLM+TLM

1. Preparing the data

2. Train the XLM model

3. Fine-tune XLM models (Applications, see below)

III. Applications: Supervised / Unsupervised MT

Pretrained XLM(MLM) models

Download / preprocess data

Pretrain a language model (with MLM)

Train on unsupervised MT from a pretrained model

IV. Applications: Cross-lingual text classification (XNLI)

Get the right tokenizers

Download / preprocess monolingual data

Download parallel data

Apply BPE and binarize

Pretrain a language model (with MLM and TLM)

Download XNLI data

Preprocess data

Fine-tune your XLM model on cross-lingual classification (XNLI)

V. Product-Key Memory Layers (PKM)

Frequently Asked Questions

How can I run experiments on multiple GPUs?

References

Cross-lingual Language Model Pretraining

XNLI: Evaluating Cross-lingual Sentence Representations

Phrase-Based & Neural Unsupervised Machine Translation

Large Memory Layers with Product Keys

Unsupervised Cross-lingual Representation Learning at Scale

License

Comments

Owner

Facebook Research

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Meta learning algorithms to train cross-lingual NLI (multi-task) models

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

A simple implementation of N-gram language model.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language