Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Last update: Jan 9, 2023

Related tags

Deep Learning VOLT

Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
July 2021: Support subword-nmt tokenization.
July 2021: Support sentencepiece tokenization.

What's On-going:

Add translation training/evaluation codes.
Support classification tasks.
Support pip usage.

Features:

Efficient: CPU learning on one machine.
Simple: The core code is no more than 200 lines.
Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

python 3.0
tqdm
mosedecoder
subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm

Usage

The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:


#Assume source_data is the file stroing data in the source language
#Assume target_data is the file stroing data in the target language
BPEROOT=subword-nmt
size=30000 # the size of BPE
cat source_data > training_data
cat target_data >> training_data

#subword-nmt style:
mkdir bpeoutput
BPE_CODE=code # the path to save vocabulary
python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file

#sentencepiece style:
mkdir spmout
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
#After this step, you will see spm.vocab and spm.model
python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
```
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
```

The third step is to use the generated vocabulary to tokenize your texts:

  #for subword-nmt toolkit
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file

  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe

  #After this step, you will see spm.vocab and spm.model
  python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Comments

代码报错

Traceback (most recent call last): File "ot_run.py", line 174, in optimal_size = run_ot(oldtokens, chars, int(max_number),int(interval)) File "ot_run.py", line 126, in run_ot Gs,_ = ot.sinkhorn(a,b,d_matrix,1.0,method='sinkhorn',numItermax=400) ValueError: too many values to unpack (expected 2)

opened by q178 6
run_ot and run_ot_write

Hi there,

I am trying to reconcile your code with the description in Algorithm 1 of your paper.

In the paper:

entropy, vocab = get_vocab(optimal matrix) vocabularies.append(entropy,vocab) Output v∗ from vocabularies satisfying Eq. 3

https://github.com/Jingjing-NLP/VOLT/blob/c9f2e692b2a25ba6be19d067020d0eeaa288ce55/ot_run.py#L141

However, in the code for run_ot, the transport matrix or a vocabulary set for each timestep t is not stored, only the (vocab_size, entropy) pairs are.

Then run_ot_write() takes this optimal vocab size, and recalculates the transport matrix again, and I don't see how this is different from when it was calculated in the for loop with run_ot, surely the same matrix is outputted? I also don't understand how run_ot_write() is doing the same thing as "Output v∗ from vocabularies satisfying Eq. 3" from Algorithm 1, as there are no vocabs being taken into consideration.

Would be very grateful if you could help clarify the above, as I am keen to implement your work :)

opened by kirefu 4
Are there any recommended hyperparameters for a larger dataset?

I try to use VOLT for a larger dataset like 100 million and it decrease the bpe file from 32000 to 6000. But the new bpe codes_file brings absolutely performance degradation in BLEU, like 35->19. Are there any parameters I need to change for a larger dataset?

opened by SefaZeng 3
Scripts for MUV-search

Hello, I am trying to use the MUV-search method mentioned in your paper to find an optimal vocabulary size. As is shown in Table 4, the optimal vocabulary size is 9.7k. However, when I am trying to reproduce the results, I find that MUV values are decreasing monotonously with the vocab size, therefore the max MUV value always belongs to the minimal vocab size.

Below are the vocab size, entropy value, and MUV values calculated on the WMT14 En-De dataset:

| BPE vocab size | Entropy | MUV | | 4117 | 5.62 | - | | 5043 | 4.32 | 1.4038877e-3 | | 5976 | 3.62 | 7.5026795e-4 | | 6885 | 3.20 | 4.6204620e-4 | | 7783 | 2.94 | 2.8953229e-4 | | 8645 | 2.76 | 2.0881671e-4 | | 9504 | 2.64 | 1.3969732e-4 | | 10348 | 2.54 | 1.1848341e-4 | | 11173 | 2.46 | 9.6969697e-5 | | 11977 | 2.39 | 8.7064677e-5 | | 19909 | 2.081 | 3.8956127e-5 |

Can you pleases provide the source code for MUV search or just point out where the error might happen?

opened by Saltychtao 2
标点符号的问题

您好，我先用sentencepiece生成了一个词表，里边的标点符号既有单个的标点(如：。/，/！/？等)，也有连续的标点(如：？？？/ 。。。/！！！)。通过VOLT算法生成的新词表中，我观察到词表中没有了单个的标点，只有连续的标点。并且VOLT生成的标点既不是中文的标点也不是英文的标点。

请问在标点处理这边，您是怎么处理的呢？

opened by andongBlue 1
Missing `ted/onemanydata` data?

run_ted_bilingual_onetomany.sh script using data from ted/onemanydata but I cannot find it in the link you provided at the end of README.md https://drive.google.com/drive/folders/1FNH7cXFYWWnUdH2LyUFFRYmaWYJJveKy

And, i'm curious to know what is the purposeted/onemanydata experiment?

opened by tiendung 1
> 请问你在测试zh-en的过程中，也是按照这样去cat zh en 训练文件吗？

请问你在测试zh-en的过程中，也是按照这样去cat zh en 训练文件吗？ cat source_file > training_data cattarget_file >> training_data

是的，不过我是分开做的，因为我的模型是中文和英文各一个词表，所以每次只cat单语种的训练语料

Originally posted by @MarsPain in https://github.com/Jingjing-NLP/VOLT/issues/9#issuecomment-889580205

opened by hnlp1993 0
Paper Issue for relaxed constraints

Hello, there,

In Paper 4.3

It is inevitable to handle illegal transport case due to relaxed constraints. We remove tokens with distributed chars less than 0.001 token frequencies.

What does We remove tokens with distributed chars less than 0.001 token frequencies. mean?

opened by hscspring 0
RuntimeWarning: overflow encountered in true_divide

Hi,

When I run ~/VOLT/ot_run.py, I got the following warning:

~/VOLT/POT/ot/bregman.py:368: RuntimeWarning: overflow encountered in true_divide v = np.divide(b, KtransposeU)

Does this affect the final result?

opened by andmek 0
Reproducibility Study pt II

Dear Jingjing,

We have found us in the situation of asking ourselves what size of vocabulary we should have in the beginning: Do we use 1.4k for de-en? Or what other size would we use?

Thank you Kyra & Joëlle

opened by KyraGolden 0
Reproducibility Study
Dear Jingjing We are trying to reproduce your experiments from your paper: "Vocabulary Learning via Optimal Transport for Neural Machine Translation" for a University Seminar at the University of Zurich. And we are looking for the TED multilingual dataset as it was not found in the GitHub repo. We would also be very grateful if you could make these datasets available:

TED EN-X data

WMT-14 EN-De

Many thanks in advance.

With kind regards,

Kyra & Joëlle
opened by KyraGolden 2
"../ot_run.py", line 107, in write_vocab

"../ot_run.py", line 107, in write_vocab left, right = token.split(" ")

嗨，这里的token.split(" ")似乎应该改成token.split("\t")

opened by Timaos123 2
fixed scripts for sentencepiece
Fixed mistakes in get_tokens.py when reading candidate tokens from vocab files generated by sentencepiece.

When using sentencepiece tokenizer, the vocabulary vocab_file generated by the run_ot.py script won't be used again in steps later on and therefore is redundant. run_ot.py was fixed to generate only size_file for sentencepiece tokenizer.
opened by xinyiz1019 0
Error when generating Sentencepiece vocab

Running ot_run.py with Sentencepiece tokenizer gives an Value Error - while trying to unpack the tokenizer split.

The reason for the error is that Sentencepiece vocab is split by a tab space. Changing the split from space to tab space in line 107 in ot_run.py will fix the issue and generate the vocab for Sentencepiece.

But since subword-nmt vocab is split by space, this fix will cause an issue when generating subword-nmt vocab.

opened by Aiden-Frost 0

Owner

GitHub

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow

18 Oct 6, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

29 Dec 28, 2022

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 6, 2022

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

1 Oct 26, 2021

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 9, 2021

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Region Proportion Regularized Inference (RePRI) for Few-Shot Segmentation In this repo, we provide the code for our paper : "Few-Shot Segmentation Wit

138 Dec 12, 2022

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

64 Nov 11, 2022

Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

203 Jan 8, 2023

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

210 Dec 30, 2022

Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University

139 Nov 18, 2022

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

43 Nov 19, 2022

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

37 Nov 21, 2022

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

296 Dec 29, 2022