**Codebase and data are uploaded in progress. **
VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.
What's New:
- July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
- July 2021: Support subword-nmt tokenization.
- July 2021: Support sentencepiece tokenization.
What's On-going:
- Add translation training/evaluation codes.
- Support classification tasks.
- Support pip usage.
Features:
- Efficient: CPU learning on one machine.
- Simple: The core code is no more than 200 lines.
- Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
- Flexible: User can customize their own tokenization rules.
Requirements and Installation
The required environments:
- python 3.0
- tqdm
- mosedecoder
- subword-nmt
To use VOLT and develop locally:
git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm
Usage
-
The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:
#Assume source_data is the file stroing data in the source language #Assume target_data is the file stroing data in the target language BPEROOT=subword-nmt size=30000 # the size of BPE cat source_data > training_data cat target_data >> training_data #subword-nmt style: mkdir bpeoutput BPE_CODE=code # the path to save vocabulary python3 $BPEROOT/learn_bpe.py -s $size < training_data > $BPE_CODE python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file #sentencepiece style: mkdir spmout python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe #After this step, you will see spm.vocab and spm.model python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
-
The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
#subword-nmt style python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \ --token_candidate_file $BPE_CODE \ --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size #sentencepiece style python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \ --token_candidate_file $BPE_CODE \ --vocab_file spmoutput/vocab --max_number 10000 --interval 1000 --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size
-
The third step is to use the generated vocabulary to tokenize your texts:
#for subword-nmt toolkit python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file #for sentencepiece toolkit, here we only keep the optimal size best_size=$(cat spmoutput/size) python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe #After this step, you will see spm.vocab and spm.model python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
Examples
We have given several examples in path "examples/".
Datasets
The WMT-14 En-de translation data can be downloaed via the running scripts.
For TED, you can download at TED.
Citation
Please cite as:
@inproceedings{volt,
title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
author= {Jingjing Xu and
Hao Zhou and
Chun Gan and
Zaixiang Zheng and
Lei Li},
booktitle = {Proceedings of ACL 2021},
year = {2021},
}