A library for Multilingual Unsupervised or Supervised word Embeddings

Facebook Research

Last update: Jan 6, 2023

Related tags

Text Data & NLP MUSE

Overview

MUSE: Multilingual Unsupervised and Supervised Embeddings

MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:

state-of-the-art multilingual word embeddings (fastText embeddings aligned in a common space)
large-scale high-quality bilingual dictionaries for training and evaluation

We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).

Dependencies

Python 2/3 with NumPy/SciPy
PyTorch
Faiss (recommended) for fast nearest neighbor search (CPU or GPU).

MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

Our 110 bilingual dictionaries
28 monolingual word similarity tasks for 6 languages, and the English word analogy task
Cross-lingual word similarity tasks from SemEval2017
Sentence translation retrieval with Europarl corpora

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

PyTorch binary files previously generated by MUSE (.pth files)
fastText binary files previously generated by fastText (.bin files)
text files (text file with one word embedding per line)

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.


Arabic: text	Bulgarian: text	Catalan: text	Croatian: text	Czech: text	Danish: text
Dutch: text	English: text	Estonian: text	Finnish: text	French: text	German: text
Greek: text	Hebrew: text	Hungarian: text	Indonesian: text	Italian: text	Macedonian: text
Norwegian: text	Polish: text	Portuguese: text	Romanian: text	Russian: text	Slovak: text
Slovenian: text	Spanish: text	Swedish: text	Turkish: text	Ukrainian: text	Vietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgt	German	English	Spanish	French	Italian	Portuguese
German	-	full train test	full train test	full train test	full train test	full train test
English	full train test	-	full train test	full train test	full train test	full train test
Spanish	full train test	full train test	-	full train test	full train test	full train test
French	full train test	full train test	full train test	-	full train test	full train test
Italian	full train test	full train test	full train test	full train test	-	full train test
Portuguese	full train test	full train test	full train test	full train test	full train test	-

Other languages to English (e.g. {fr,es}-en)


Afrikaans: full train test	Albanian: full train test	Arabic: full train test	Bengali: full train test
Bosnian: full train test	Bulgarian: full train test	Catalan: full train test	Chinese: full train test
Croatian: full train test	Czech: full train test	Danish: full train test	Dutch: full train test
English: full train test	Estonian: full train test	Filipino: full train test	Finnish: full train test
French: full train test	German: full train test	Greek: full train test	Hebrew: full train test
Hindi: full train test	Hungarian: full train test	Indonesian: full train test	Italian: full train test
Japanese: full train test	Korean: full train test	Latvian: full train test	Littuanian: full train test
Macedonian: full train test	Malay: full train test	Norwegian: full train test	Persian: full train test
Polish: full train test	Portuguese: full train test	Romanian: full train test	Russian: full train test
Slovak: full train test	Slovenian: full train test	Spanish: full train test	Swedish: full train test
Tamil: full train test	Thai: full train test	Turkish: full train test	Ukrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})


Afrikaans: full train test	Albanian: full train test	Arabic: full train test	Bengali: full train test
Bosnian: full train test	Bulgarian: full train test	Catalan: full train test	Chinese: full train test
Croatian: full train test	Czech: full train test	Danish: full train test	Dutch: full train test
English: full train test	Estonian: full train test	Filipino: full train test	Finnish: full train test
French: full train test	German: full train test	Greek: full train test	Hebrew: full train test
Hindi: full train test	Hungarian: full train test	Indonesian: full train test	Italian: full train test
Japanese: full train test	Korean: full train test	Latvian: full train test	Littuanian: full train test
Macedonian: full train test	Malay: full train test	Norwegian: full train test	Persian: full train test
Polish: full train test	Portuguese: full train test	Romanian: full train test	Russian: full train test
Slovak: full train test	Slovenian: full train test	Spanish: full train test	Swedish: full train test
Tamil: full train test	Thai: full train test	Turkish: full train test	Ukrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Comments

Use of Validation Dictionary during Unsupervised Training

Hello - I have been training MUSE embeddings for a number of low-resource languages and I discovered that the model is being iteratively validated using an internal dictionary, even in the unsupervised case. I discovered this by coincidence when training models for Uyghur and Tigrinya, which do not have any 'pre-trained' dictionaries, and I got an error message from the evaluator, saying that it could not find the dictionary under: data/crosslingual/dictionaries/en-.5000-6500.txt

I also tried uncommenting lines 217 and 219 under src/evaluation/evaluator.py, but that gave me another error from the trainer. Could you advise on what the error means?

File "unsupervised.py", line 143, in trainer.save_best(to_log, VALIDATION_METRIC) File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/trainer.py", line 224, in save_best if to_log[metric] > self.best_valid_metric: KeyError: 'mean_cosine-csls_knn_10-S2T-10000'

I imagine that if I created a dummy dictionary file, the same thing would happen.

Thank you, Noura

opened by narnoura 16
'">

ValueError: could not convert string to float: 'encoding="utf-8"?>'

Hi I am a beginner of MUSE. I tried to trained unsupervised training by using Japanese and English pre-trained word vectors. For Japanese I cleaned a collection of Japanese text with MeCab and embedded in fastText (300d). For English I took pre-trained word vectors crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens) from fastText. Here is a command to train the model at GPU environment: CUDA_VISIBLE_DEVICES=1,2 python unsupervised.py --src_lang ja --tgt_lang en --src_emb /item_embdd/skipgram/allgenre_model.vec --tgt_emb /pretrained_vec/en/crawl-300d-2M.vec 2> error20190214a.txt I got the error messages as below: Traceback (most recent call last): File "unsupervised.py", line 139, in <module> evaluator.all_eval(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 215, in all_eval self.monolingual_wordsim(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 49, in monolingual_wordsim ) if self.params.tgt_lang else None File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 105, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 69, in get_spearman_rho word_pairs = get_word_pairs(path) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 39, in get_word_pairs word_pairs.append((line[0], line[1], float(line[2]))) ValueError: could not convert string to float: 'encoding="utf-8"?>' Could anyone give me advice or comment? Thanks in advance.

opened by learnercat 15

Average time to align monolingual word embeddings: the supervised way?

I am aligning english and hindi fasttext monolingual embeddings using the the supervised way on a GPU. Are there are time estimates as to how long it takes? It's been 4 hours, and it is still in the first refinement step.

I ran the following command:

python supervised.py --src_lang en --tgt_lang hi --src_emb wiki.en.vec --tgt_emb wiki.hi.vec --n_iter 5 --dico_train default

Update: it was running for close to 20 hours on a GeForce GTX 1080, constantly hogging 1 CPU core, but no entries were added to the log. I am running it again.

Log:

INFO - 12/27/17 17:57:14 - 0:00:00 - ============ Initialized logger ============
INFO - 12/27/17 17:57:14 - 0:00:00 - cuda: True
                                     dico_build: S2T&T2S
                                     dico_max_rank: 10000
                                     dico_max_size: 0
                                     dico_method: csls_knn_10
                                     dico_min_size: 0
                                     dico_threshold: 0
                                     dico_train: default
                                     emb_dim: 300
                                     exp_path: /MUSE/dumped/hidden
                                     export: True
                                     max_vocab: 200000
                                     n_iters: 5
                                     normalize_embeddings: 
                                     seed: -1
                                     src_emb:wiki.en.vec
                                     src_lang: en
                                     tgt_emb: wiki.hi.vec
                                     tgt_lang: hi
                                     verbose: 2
INFO - 12/27/17 17:57:14 - 0:00:00 - The experiment will be stored in hidden/MUSE/dumped/hidden
INFO - 12/27/17 17:57:25 - 0:00:11 - Loaded 200000 pre-trained word embeddings
INFO - 12/27/17 17:57:45 - 0:00:31 - Loaded 158016 pre-trained word embeddings
INFO - 12/27/17 17:57:49 - 0:00:34 - Found 8704 pairs of words in the dictionary (4998 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:49 - 0:00:34 - Starting refinement iteration 0...
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                        Dataset      Found     Not found          Rho
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-771        771             0       0.6689
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-287        286             1       0.6773
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_SIMLEX-999        998             1       0.3823
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-REL        252             0       0.6820
INFO - 12/27/17 17:57:49 - 0:00:35 -                 EN_RW-STANFORD       1323           711       0.5080
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_MC-30         30             0       0.8123
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-ALL        353             0       0.7388
INFO - 12/27/17 17:57:49 - 0:00:35 -                    EN_VERB-143        144             0       0.3973
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MEN-TR-3k       3000             0       0.7637
INFO - 12/27/17 17:57:49 - 0:00:35 -                      EN_YP-130        130             0       0.5333
INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_RG-65         65             0       0.7974
INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_SEMEVAL17        379             9       0.7216
INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-SIM        203             0       0.7811
INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
INFO - 12/27/17 17:57:49 - 0:00:35 - Monolingual source word similarity score average: 0.65108
INFO - 12/27/17 17:57:49 - 0:00:35 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
INFO - 12/27/17 17:57:50 - 0:00:36 - 1500 source words - nn - Precision at k = 1: 23.800000
INFO - 12/27/17 17:57:51 - 0:00:36 - 1500 source words - nn - Precision at k = 5: 41.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - 1500 source words - nn - Precision at k = 10: 48.133333
INFO - 12/27/17 17:57:51 - 0:00:37 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)

opened by gvishal 12

Hindi not visible

In other languages to English, and English to other languages, Hindi I am seeing some strange characters in place of Hindi words, whereas English is being displayed just fine. Is it some fault on my side?

opened by hiteshn97 11
where I tried the supervised.py, I got "ValueError: The input must have at least 3 entries!"

Hello, I used docker to build an environment which contained conda, pytorch and faiss with python3.6. Finally it can run this amazing open source.

But when I tried the following command to test the supervised method:

python3 supervised.py --src_lang en --tgt_lang es --src_emb ../wiki.en.vec --tgt_emb ../Spanish_wiki.es.vec --n_iter 5 --dico_train identical_char --cuda False

It returned a ValueError said "The input must have at least 3 entries!".

Here is the logs:

Failed to load GPU Faiss: No module named 'swigfaiss_gpu' Faiss falling back to CPU-only. Impossible to import Faiss-GPU. Switching to FAISS-CPU, this will be slower.

INFO - 01/06/18 12:20:38 - 0:00:00 - ============ Initialized logger ============ INFO - 01/06/18 12:20:38 - 0:00:00 - cuda: False dico_build: S2T&T2S dico_max_rank: 10000 dico_max_size: 0 dico_method: csls_knn_10 dico_min_size: 0 dico_threshold: 0 dico_train: identical_char emb_dim: 300 exp_path: /Documents/MUSE-master/dumped/nrthsd26ay export: True max_vocab: 200000 n_iters: 5 normalize_embeddings: seed: -1 src_emb: ../wiki.en.vec src_lang: en tgt_emb: ../Spanish_wiki.es.vec tgt_lang: es verbose: 2 INFO - 01/06/18 12:20:38 - 0:00:00 - The experiment will be stored in /Documents/MUSE-master/dumped/nrthsd26ay INFO - 01/06/18 12:20:48 - 0:00:10 - Loaded 200000 pre-trained word embeddings INFO - 01/06/18 12:21:02 - 0:00:24 - Loaded 200000 pre-trained word embeddings INFO - 01/06/18 12:21:04 - 0:00:26 - Found 85912 pairs of identical character strings. INFO - 01/06/18 12:21:05 - 0:00:26 - Starting refinement iteration 0... INFO - 01/06/18 12:22:14 - 0:01:36 - ==================================================================== INFO - 01/06/18 12:22:14 - 0:01:36 - Dataset Found Not found Rho INFO - 01/06/18 12:22:14 - 0:01:36 - ==================================================================== here: n--> 771 m--> 771 Traceback (most recent call last): File "supervised.py", line 92, in evaluator.all_eval(to_log) File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 188, in all_eval self.monolingual_wordsim(to_log) File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 43, in monolingual_wordsim self.mapping(self.src_emb.weight).data.cpu().numpy() File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 104, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 83, in get_spearman_rho return spearmanr(gold, pred).correlation, len(gold), not_found File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/stats.py", line 3301, in spearmanr rho, pval = mstats_basic.spearmanr(a, b, axis) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/mstats_basic.py", line 461, in spearmanr raise ValueError("The input must have at least 3 entries!") ValueError: The input must have at least 3 entries!

Does anyone have ideas with these problem? Thanks ^^.

opened by miscy210 11

Results obtained are different from that published on the paper

there are my setting below, and the rest of parameters are remained as default. these words vectors and zh-en dictionary were downloaded from official site.

export SRC_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.zh.vec 
export TGT_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.en.vec
nohup python unsupervised.py --src_lang zh --tgt_lang en --src_emb $SRC_EMB --tgt_emb $TGT_EMB --cuda 1 --export 1 --exp_path ./dumped/unsuperv/zh-mn --emb_dim 300 --refinement true --adversarial true > zh-en-unsuper.log &

but the results are just 0s:

east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 10: 0.000000
INFO - 04/24/18 10:03:34 - 0:08:39 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at l
east one unknown word (0 in lang1, 0 in lang2)
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000

opened by yudianer 9

Using --dico_train identical_char still needs dictionaries
according to the docs

when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary.````

I understood that the dictionary was going to be created using the given corpus
opened by DavidGOrtega 9
Reproducing the EN-ZH results in Table 1

Hi,

I tried training MUSE in the unsupervised way with the pretrained fasttext Wikipedia embeddings. On some European language pairs, such as EN-DE or EN-ES, I was able to get reasonable performance using the default parameters. However, when for EN-ZH or ZH-EN, using the default parameters, the cross-lingual word similarity scores are always 0 (even for top 10).

As a comparison, to rule out problems with the data, I ran the supervised setting for EN-ZH, and it gave non-zero performance (though the number is a few points lower than that in the paper).

Any idea of what I might have done wrong? Thank you.

opened by ccsasuke 9
Understanding the output of training
after training, in dumped/debug/xohu3xpdfn I get the following files (trained for english hindi)

best_mapping.pth

vectors-en.txt

vectors-hi.txt

params.pkl

train.log

are the .txt files containing mapped vectors? if not how can I obtain the mapping?
opened by euler16 8
how can i do translation task?

This might be dumb. I read the paper and git repo. Could you briefly tell me on a high level, how can i do translation task, given src_embedding and target_embeddings?

I understand i can do src_word -> src_embedding -> matrix transform to target_embedding. Then how do i retrieve the target_word?

thanks!

opened by ecilay 8
2909: RuntimeWarning: Mean of empty slice.

I tried unsupervised.py by wiki.en.vec and wiki.ja.vec pretrained from fasttext. python unsupervised.py --src_lang en --tgt_lang ja --src_emb wiki.en.vec --tgt_emb wiki.ja.vec --n_refinement 5 --cuda 1 --exp_path vec --dico_eval en-ja.5000-6500.txt --normalize_embeddings center but some Warning and k=0 INFO - 06/06/18 17:59:04 - 0:10:32 - 996000 - Discriminator loss: 0.3396 - 3339 samples/s INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== INFO - 06/06/18 17:59:07 - 0:10:35 - Dataset Found Not found Rho INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== /***.pyenv/versions/3.6.1/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /***/.pyenv/versions/3.6.1/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) INFO - 06/06/18 17:59:07 - 0:10:35 - Monolingual source word similarity score average: nan INFO - 06/06/18 17:59:07 - 0:10:35 - Found 1799 pairs of words in the dictionary (1459 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2) INFO - 06/06/18 17:59:07 - 0:10:35 - 1459 source words - nn - Precision at k = 1: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - 1459 source words - nn - Precision at k = 5: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - 1459 source words - nn - Precision at k = 10: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - Found 1799 pairs of words in the dictionary (1459 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2) INFO - 06/06/18 17:59:38 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 1: 0.000000 INFO - 06/06/18 17:59:38 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 5: 0.068540 INFO - 06/06/18 17:59:39 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 10: 0.068540

opened by ghost 8
self-mapped english words in dictionaries

Hi, I've checked en-ms, en-id, en-ja, etc. files, and it seems that in many cases the English word has been mapped to itself and there is no translation. Is there any reason for that? To use the dictionaries, can we simply discard them?

opened by Sara-Rajaee 0
Tried on GloVe?

I have managed to replicate the results on the paper for English and German FastText. For reference, I am interested in the cross-lingual word similarity task. Results I got are (reporting spearman correlation): Original FastText: 9% Mapped FastText: 71%

However, I tried the same code on English and German GloVe embeddings and did not get much improvement. Results: Original GloVe: 1% Mapped GloVe: 3%

Any idea why this might be the case?

opened by Wafaa014 0
added a 'node' script to compress models up to 1/10

Why: Read models data on rest and transferring on network is so tedious.

I used a simple script in NodeJS relying on protobufjs to compress "Multilingual word Embeddings" models. If anyone is good in packaging, it would be better not to rely on an npm/yarn package but a single script instead. Maybe in Python since the whole project is in Python.
CLA Signed

opened by bacloud23 6
[ML Question] Is it possible somehow to translate two or three words ?

Can anyone tell please if very small sentences translation could be achieved somehow ? What I mean is not translating the exact sentence, I think that would be beyond scope because sentences contain tense and propositions and so on which are very ambiguous.

What I want to say, is it possible somehow to translate for instance: "bat" in "baseball bat" is so different from "bat wings", in this situation, probably the context could help but I'm not sure how or even if this is possible using these models ?

Any help, or a hint on how to achieve this is really appreciated. Thanks

opened by bacloud23 0
Reduce memory usage on loading embedding from txt

Original implementation of read_txt_embeddings takes a lot of memory. For example, to load an embedding txt file that contains a vocab size of 2,000,000 with 300 embedding dimension, vectors list takes 643002,000,000=4.8 GB, np.concatenate takes 4.8 GB and torch.from_numpy takes 2.4 GB, totally it takes around 12 GB. Knowing vocab_size in advance and setting dtype of vector to np.float32, memory requirement can be reduced to around 2.4 GB instead of 12GB.
CLA Signed

opened by yeyinthtoon 2

A library for Multilingual Unsupervised or Supervised word Embeddings

Related tags

Overview

MUSE: Multilingual Unsupervised and Supervised Embeddings

Dependencies

Get evaluation datasets

Get monolingual word embeddings

Align monolingual word embeddings

The supervised way: iterative Procrustes (CPU|GPU)

The unsupervised way: adversarial training and refinement (CPU|GPU)

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

Word embedding format

Download

Multilingual word Embeddings

Ground-truth bilingual dictionaries

References

Word Translation Without Parallel Data

Unsupervised Machine Translation With Monolingual Data Only

Related work

Comments

Hello, I used docker to build an environment which contained conda, pytorch and faiss with python3.6. Finally it can run this amazing open source.

But when I tried the following command to test the supervised method:

It returned a ValueError said "The input must have at least 3 entries!".

Here is the logs:

Does anyone have ideas with these problem? Thanks ^^.

Owner

Facebook Research

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Multilingual text (NLP) processing toolkit

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer