A library for Multilingual Unsupervised or Supervised word Embeddings

Related tags

Text Data & NLP MUSE
Overview

MUSE: Multilingual Unsupervised and Supervised Embeddings

Model

MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:

  • state-of-the-art multilingual word embeddings (fastText embeddings aligned in a common space)
  • large-scale high-quality bilingual dictionaries for training and evaluation

We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).

Dependencies

MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

  • Our 110 bilingual dictionaries
  • 28 monolingual word similarity tasks for 6 languages, and the English word analogy task
  • Cross-lingual word similarity tasks from SemEval2017
  • Sentence translation retrieval with Europarl corpora

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

  • Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
  • Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

  • PyTorch binary files previously generated by MUSE (.pth files)
  • fastText binary files previously generated by fastText (.bin files)
  • text files (text file with one word embedding per line)

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.

Arabic: text Bulgarian: text Catalan: text Croatian: text Czech: text Danish: text
Dutch: text English: text Estonian: text Finnish: text French: text German: text
Greek: text Hebrew: text Hungarian: text Indonesian: text Italian: text Macedonian: text
Norwegian: text Polish: text Portuguese: text Romanian: text Russian: text Slovak: text
Slovenian: text Spanish: text Swedish: text Turkish: text Ukrainian: text Vietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgt German English Spanish French Italian Portuguese
German - full train test full train test full train test full train test full train test
English full train test - full train test full train test full train test full train test
Spanish full train test full train test - full train test full train test full train test
French full train test full train test full train test - full train test full train test
Italian full train test full train test full train test full train test - full train test
Portuguese full train test full train test full train test full train test full train test -

Other languages to English (e.g. {fr,es}-en)

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Related work

Contact: [email protected] [email protected]

Comments
  • Use of Validation Dictionary during Unsupervised Training

    Use of Validation Dictionary during Unsupervised Training

    Hello - I have been training MUSE embeddings for a number of low-resource languages and I discovered that the model is being iteratively validated using an internal dictionary, even in the unsupervised case. I discovered this by coincidence when training models for Uyghur and Tigrinya, which do not have any 'pre-trained' dictionaries, and I got an error message from the evaluator, saying that it could not find the dictionary under: data/crosslingual/dictionaries/en-.5000-6500.txt

    I also tried uncommenting lines 217 and 219 under src/evaluation/evaluator.py, but that gave me another error from the trainer. Could you advise on what the error means?

    File "unsupervised.py", line 143, in trainer.save_best(to_log, VALIDATION_METRIC) File "/proj/nlpdisk3/nlpusers/noura/deep-learning/Experiments/Embeddings/MUSE/src/trainer.py", line 224, in save_best if to_log[metric] > self.best_valid_metric: KeyError: 'mean_cosine-csls_knn_10-S2T-10000'

    I imagine that if I created a dummy dictionary file, the same thing would happen.

    Thank you, Noura

    opened by narnoura 16
  • ValueError: could not convert string to float: 'encoding='">

    ValueError: could not convert string to float: 'encoding="utf-8"?>'

    Hi I am a beginner of MUSE. I tried to trained unsupervised training by using Japanese and English pre-trained word vectors. For Japanese I cleaned a collection of Japanese text with MeCab and embedded in fastText (300d). For English I took pre-trained word vectors crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens) from fastText. Here is a command to train the model at GPU environment: CUDA_VISIBLE_DEVICES=1,2 python unsupervised.py --src_lang ja --tgt_lang en --src_emb /item_embdd/skipgram/allgenre_model.vec --tgt_emb /pretrained_vec/en/crawl-300d-2M.vec 2> error20190214a.txt I got the error messages as below: Traceback (most recent call last): File "unsupervised.py", line 139, in <module> evaluator.all_eval(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 215, in all_eval self.monolingual_wordsim(to_log) File "/multi_embedd/MUSE/src/evaluation/evaluator.py", line 49, in monolingual_wordsim ) if self.params.tgt_lang else None File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 105, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 69, in get_spearman_rho word_pairs = get_word_pairs(path) File "/multi_embedd/MUSE/src/evaluation/wordsim.py", line 39, in get_word_pairs word_pairs.append((line[0], line[1], float(line[2]))) ValueError: could not convert string to float: 'encoding="utf-8"?>' Could anyone give me advice or comment? Thanks in advance.

    opened by learnercat 15
  • Average time to align monolingual word embeddings: the supervised way?

    Average time to align monolingual word embeddings: the supervised way?

    I am aligning english and hindi fasttext monolingual embeddings using the the supervised way on a GPU. Are there are time estimates as to how long it takes? It's been 4 hours, and it is still in the first refinement step.

    I ran the following command:

    python supervised.py --src_lang en --tgt_lang hi --src_emb wiki.en.vec --tgt_emb wiki.hi.vec --n_iter 5 --dico_train default
    

    Update: it was running for close to 20 hours on a GeForce GTX 1080, constantly hogging 1 CPU core, but no entries were added to the log. I am running it again.

    Log:

    INFO - 12/27/17 17:57:14 - 0:00:00 - ============ Initialized logger ============
    INFO - 12/27/17 17:57:14 - 0:00:00 - cuda: True
                                         dico_build: S2T&T2S
                                         dico_max_rank: 10000
                                         dico_max_size: 0
                                         dico_method: csls_knn_10
                                         dico_min_size: 0
                                         dico_threshold: 0
                                         dico_train: default
                                         emb_dim: 300
                                         exp_path: /MUSE/dumped/hidden
                                         export: True
                                         max_vocab: 200000
                                         n_iters: 5
                                         normalize_embeddings: 
                                         seed: -1
                                         src_emb:wiki.en.vec
                                         src_lang: en
                                         tgt_emb: wiki.hi.vec
                                         tgt_lang: hi
                                         verbose: 2
    INFO - 12/27/17 17:57:14 - 0:00:00 - The experiment will be stored in hidden/MUSE/dumped/hidden
    INFO - 12/27/17 17:57:25 - 0:00:11 - Loaded 200000 pre-trained word embeddings
    INFO - 12/27/17 17:57:45 - 0:00:31 - Loaded 158016 pre-trained word embeddings
    INFO - 12/27/17 17:57:49 - 0:00:34 - Found 8704 pairs of words in the dictionary (4998 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
    INFO - 12/27/17 17:57:49 - 0:00:34 - Starting refinement iteration 0...
    INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
    INFO - 12/27/17 17:57:49 - 0:00:35 -                        Dataset      Found     Not found          Rho
    INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
    INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-771        771             0       0.6689
    INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MTurk-287        286             1       0.6773
    INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_SIMLEX-999        998             1       0.3823
    INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-REL        252             0       0.6820
    INFO - 12/27/17 17:57:49 - 0:00:35 -                 EN_RW-STANFORD       1323           711       0.5080
    INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_MC-30         30             0       0.8123
    INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-ALL        353             0       0.7388
    INFO - 12/27/17 17:57:49 - 0:00:35 -                    EN_VERB-143        144             0       0.3973
    INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_MEN-TR-3k       3000             0       0.7637
    INFO - 12/27/17 17:57:49 - 0:00:35 -                      EN_YP-130        130             0       0.5333
    INFO - 12/27/17 17:57:49 - 0:00:35 -                       EN_RG-65         65             0       0.7974
    INFO - 12/27/17 17:57:49 - 0:00:35 -                   EN_SEMEVAL17        379             9       0.7216
    INFO - 12/27/17 17:57:49 - 0:00:35 -                  EN_WS-353-SIM        203             0       0.7811
    INFO - 12/27/17 17:57:49 - 0:00:35 - ====================================================================
    INFO - 12/27/17 17:57:49 - 0:00:35 - Monolingual source word similarity score average: 0.65108
    INFO - 12/27/17 17:57:49 - 0:00:35 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
    INFO - 12/27/17 17:57:50 - 0:00:36 - 1500 source words - nn - Precision at k = 1: 23.800000
    INFO - 12/27/17 17:57:51 - 0:00:36 - 1500 source words - nn - Precision at k = 5: 41.133333
    INFO - 12/27/17 17:57:51 - 0:00:37 - 1500 source words - nn - Precision at k = 10: 48.133333
    INFO - 12/27/17 17:57:51 - 0:00:37 - Found 2032 pairs of words in the dictionary (1500 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2)
    
    
    opened by gvishal 12
  • Hindi not visible

    Hindi not visible

    In other languages to English, and English to other languages, Hindi I am seeing some strange characters in place of Hindi words, whereas English is being displayed just fine. Is it some fault on my side?

    opened by hiteshn97 11
  • where I tried the supervised.py, I got

    where I tried the supervised.py, I got "ValueError: The input must have at least 3 entries!"

    Hello, I used docker to build an environment which contained conda, pytorch and faiss with python3.6. Finally it can run this amazing open source.

    But when I tried the following command to test the supervised method:

    python3 supervised.py --src_lang en --tgt_lang es --src_emb ../wiki.en.vec --tgt_emb ../Spanish_wiki.es.vec --n_iter 5 --dico_train identical_char --cuda False

    It returned a ValueError said "The input must have at least 3 entries!".

    Here is the logs:

    Failed to load GPU Faiss: No module named 'swigfaiss_gpu' Faiss falling back to CPU-only. Impossible to import Faiss-GPU. Switching to FAISS-CPU, this will be slower.

    INFO - 01/06/18 12:20:38 - 0:00:00 - ============ Initialized logger ============ INFO - 01/06/18 12:20:38 - 0:00:00 - cuda: False dico_build: S2T&T2S dico_max_rank: 10000 dico_max_size: 0 dico_method: csls_knn_10 dico_min_size: 0 dico_threshold: 0 dico_train: identical_char emb_dim: 300 exp_path: /Documents/MUSE-master/dumped/nrthsd26ay export: True max_vocab: 200000 n_iters: 5 normalize_embeddings: seed: -1 src_emb: ../wiki.en.vec src_lang: en tgt_emb: ../Spanish_wiki.es.vec tgt_lang: es verbose: 2 INFO - 01/06/18 12:20:38 - 0:00:00 - The experiment will be stored in /Documents/MUSE-master/dumped/nrthsd26ay INFO - 01/06/18 12:20:48 - 0:00:10 - Loaded 200000 pre-trained word embeddings INFO - 01/06/18 12:21:02 - 0:00:24 - Loaded 200000 pre-trained word embeddings INFO - 01/06/18 12:21:04 - 0:00:26 - Found 85912 pairs of identical character strings. INFO - 01/06/18 12:21:05 - 0:00:26 - Starting refinement iteration 0... INFO - 01/06/18 12:22:14 - 0:01:36 - ==================================================================== INFO - 01/06/18 12:22:14 - 0:01:36 - Dataset Found Not found Rho INFO - 01/06/18 12:22:14 - 0:01:36 - ==================================================================== here: n--> 771 m--> 771 Traceback (most recent call last): File "supervised.py", line 92, in evaluator.all_eval(to_log) File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 188, in all_eval self.monolingual_wordsim(to_log) File "/Documents/MUSE-master/src/evaluation/evaluator.py", line 43, in monolingual_wordsim self.mapping(self.src_emb.weight).data.cpu().numpy() File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 104, in get_wordsim_scores coeff, found, not_found = get_spearman_rho(word2id, embeddings, filepath, lower) File "/Documents/MUSE-master/src/evaluation/wordsim.py", line 83, in get_spearman_rho return spearmanr(gold, pred).correlation, len(gold), not_found File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/stats.py", line 3301, in spearmanr rho, pval = mstats_basic.spearmanr(a, b, axis) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/scipy/stats/mstats_basic.py", line 461, in spearmanr raise ValueError("The input must have at least 3 entries!") ValueError: The input must have at least 3 entries!

    Does anyone have ideas with these problem? Thanks ^^.

    opened by miscy210 11
  • Results obtained are different from that published on the paper

    Results obtained are different from that published on the paper

    there are my setting below, and the rest of parameters are remained as default. these words vectors and zh-en dictionary were downloaded from official site.

    export SRC_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.zh.vec 
    export TGT_EMB=/home/jack/dev1.8t/corpus/zh-en/wiki.en.vec
    nohup python unsupervised.py --src_lang zh --tgt_lang en --src_emb $SRC_EMB --tgt_emb $TGT_EMB --cuda 1 --export 1 --exp_path ./dumped/unsuperv/zh-mn --emb_dim 300 --refinement true --adversarial true > zh-en-unsuper.log &
    

    but the results are just 0s:

    east one unknown word (0 in lang1, 0 in lang2)
    INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 1: 0.000000
    INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 5: 0.000000
    INFO - 04/24/18 10:03:34 - 0:08:39 - 1500 source words - nn - Precision at k = 10: 0.000000
    INFO - 04/24/18 10:03:34 - 0:08:39 - Found 2483 pairs of words in the dictionary (1500 unique). 0 other pairs contained at l
    east one unknown word (0 in lang1, 0 in lang2)
    INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 1: 0.000000
    INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 5: 0.000000
    INFO - 04/24/18 10:03:44 - 0:08:49 - 1500 source words - csls_knn_10 - Precision at k = 10: 0.000000
    
    
    opened by yudianer 9
  • Using --dico_train identical_char still needs dictionaries

    Using --dico_train identical_char still needs dictionaries

    according to the docs

    when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary.````
    

    I understood that the dictionary was going to be created using the given corpus

    opened by DavidGOrtega 9
  • Reproducing the EN-ZH results in Table 1

    Reproducing the EN-ZH results in Table 1

    Hi,

    I tried training MUSE in the unsupervised way with the pretrained fasttext Wikipedia embeddings. On some European language pairs, such as EN-DE or EN-ES, I was able to get reasonable performance using the default parameters. However, when for EN-ZH or ZH-EN, using the default parameters, the cross-lingual word similarity scores are always 0 (even for top 10).

    As a comparison, to rule out problems with the data, I ran the supervised setting for EN-ZH, and it gave non-zero performance (though the number is a few points lower than that in the paper).

    Any idea of what I might have done wrong? Thank you.

    opened by ccsasuke 9
  • Understanding the output of training

    Understanding the output of training

    after training, in dumped/debug/xohu3xpdfn I get the following files (trained for english hindi)

    • best_mapping.pth
    • vectors-en.txt
    • vectors-hi.txt
    • params.pkl
    • train.log

    are the .txt files containing mapped vectors? if not how can I obtain the mapping?

    opened by euler16 8
  • how can i do translation task?

    how can i do translation task?

    This might be dumb. I read the paper and git repo. Could you briefly tell me on a high level, how can i do translation task, given src_embedding and target_embeddings?

    I understand i can do src_word -> src_embedding -> matrix transform to target_embedding. Then how do i retrieve the target_word?

    thanks!

    opened by ecilay 8
  • 2909: RuntimeWarning: Mean of empty slice.

    2909: RuntimeWarning: Mean of empty slice.

    I tried unsupervised.py by wiki.en.vec and wiki.ja.vec pretrained from fasttext. python unsupervised.py --src_lang en --tgt_lang ja --src_emb wiki.en.vec --tgt_emb wiki.ja.vec --n_refinement 5 --cuda 1 --exp_path vec --dico_eval en-ja.5000-6500.txt --normalize_embeddings center but some Warning and k=0 INFO - 06/06/18 17:59:04 - 0:10:32 - 996000 - Discriminator loss: 0.3396 - 3339 samples/s INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== INFO - 06/06/18 17:59:07 - 0:10:35 - Dataset Found Not found Rho INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== INFO - 06/06/18 17:59:07 - 0:10:35 - ==================================================================== /***.pyenv/versions/3.6.1/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2909: RuntimeWarning: Mean of empty slice. out=out, **kwargs) /***/.pyenv/versions/3.6.1/lib/python3.6/site-packages/numpy/core/_methods.py:80: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) INFO - 06/06/18 17:59:07 - 0:10:35 - Monolingual source word similarity score average: nan INFO - 06/06/18 17:59:07 - 0:10:35 - Found 1799 pairs of words in the dictionary (1459 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2) INFO - 06/06/18 17:59:07 - 0:10:35 - 1459 source words - nn - Precision at k = 1: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - 1459 source words - nn - Precision at k = 5: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - 1459 source words - nn - Precision at k = 10: 0.000000 INFO - 06/06/18 17:59:08 - 0:10:35 - Found 1799 pairs of words in the dictionary (1459 unique). 0 other pairs contained at least one unknown word (0 in lang1, 0 in lang2) INFO - 06/06/18 17:59:38 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 1: 0.000000 INFO - 06/06/18 17:59:38 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 5: 0.068540 INFO - 06/06/18 17:59:39 - 0:11:06 - 1459 source words - csls_knn_10 - Precision at k = 10: 0.068540

    opened by ghost 8
  • self-mapped english words in dictionaries

    self-mapped english words in dictionaries

    Hi, I've checked en-ms, en-id, en-ja, etc. files, and it seems that in many cases the English word has been mapped to itself and there is no translation. Is there any reason for that? To use the dictionaries, can we simply discard them?

    opened by Sara-Rajaee 0
  • Tried on GloVe?

    Tried on GloVe?

    I have managed to replicate the results on the paper for English and German FastText. For reference, I am interested in the cross-lingual word similarity task. Results I got are (reporting spearman correlation): Original FastText: 9% Mapped FastText: 71%

    However, I tried the same code on English and German GloVe embeddings and did not get much improvement. Results: Original GloVe: 1% Mapped GloVe: 3%

    Any idea why this might be the case?

    opened by Wafaa014 0
  • added a 'node' script to compress models up to 1/10

    added a 'node' script to compress models up to 1/10

    Why: Read models data on rest and transferring on network is so tedious.

    I used a simple script in NodeJS relying on protobufjs to compress "Multilingual word Embeddings" models. If anyone is good in packaging, it would be better not to rely on an npm/yarn package but a single script instead. Maybe in Python since the whole project is in Python.

    CLA Signed 
    opened by bacloud23 6
  • [ML Question] Is it possible somehow to translate two or three words ?

    [ML Question] Is it possible somehow to translate two or three words ?

    Can anyone tell please if very small sentences translation could be achieved somehow ? What I mean is not translating the exact sentence, I think that would be beyond scope because sentences contain tense and propositions and so on which are very ambiguous.

    What I want to say, is it possible somehow to translate for instance: "bat" in "baseball bat" is so different from "bat wings", in this situation, probably the context could help but I'm not sure how or even if this is possible using these models ?

    Any help, or a hint on how to achieve this is really appreciated. Thanks

    opened by bacloud23 0
  • Reduce memory usage on loading embedding from txt

    Reduce memory usage on loading embedding from txt

    Original implementation of read_txt_embeddings takes a lot of memory. For example, to load an embedding txt file that contains a vocab size of 2,000,000 with 300 embedding dimension, vectors list takes 643002,000,000=4.8 GB, np.concatenate takes 4.8 GB and torch.from_numpy takes 2.4 GB, totally it takes around 12 GB. Knowing vocab_size in advance and setting dtype of vector to np.float32, memory requirement can be reduced to around 2.4 GB instead of 12GB.

    CLA Signed 
    opened by yeyinthtoon 2
Owner
Facebook Research
Facebook Research
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

SaiVenkatDhulipudi 2 Nov 17, 2021
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 46 Dec 15, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

Greg Ver Steeg 592 Dec 18, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 7, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 1.8k Feb 10, 2021
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 1.8k Feb 18, 2021
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-population, and their combinations to provide a comprehensive robustness analysis.

TextFlint 587 Dec 20, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 5, 2023
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
null 189 Jan 2, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 2, 2022
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 7, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021