Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
This is the implementaion of our paper:
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation
Zhiwei He*, Xing Wang, Rui Wang, Shuming Shi, Zhaopeng Tu
ACL 2022 (long paper, main conference)
We based this code heavily on the original code of XLM, MASS and Deepaicode.
Dependencies
-
Python3
-
Pytorch1.7.1
pip3 install torch==1.7.1+cu110
-
fastBPE
-
Apex
git clone https://github.com/NVIDIA/apex cd apex git reset --hard 0c2c6ee pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
Data ready
We prepared the data following the instruction from XLM (Section III). We used their released scripts, BPE codes and vocabularies. However, there are some differences with them:
-
All available data is used, not just 5,000,000 sentences per language
-
For Romanian, we augment it with the monolingual data from WMT16.
-
Noisy sentences are removed:
python3 filter_noisy_data.py --input all.en --lang en --output clean.en
-
For English-German, we used the processed data provided by KaiTao Song.
Considering that it can take a very long time to prepare the data, we provide the processed data for download:
Pre-trained models
We adopted the released XLM and MASS models for all language pairs. In order to better reproduce the results for MASS on En-De, we used monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and selected the best model (epoch@270) by perplexity (PPL) on the validation set.
Here are pre-trained models we used:
Languages | XLM | MASS |
---|---|---|
English-French | Model | Model |
English-German | Model | Model |
English-Romanian | Model | Model |
Model training
We provide training scripts and trained models for UNMT baseline and our approach with online self-training.
Training scripts
Train UNMT model with online self-training and XLM initialization:
cd scripts
sh run-xlm-unmt-st-ende.sh
Note: remember to modify the path variables in the header of the shell script.
Trained model
We selected the best model by BLEU score on the validation set for both directions. Therefore, we release En-X and X-En models for each experiment.
Approch | XLM | MASS | ||
---|---|---|---|---|
UNMT | En-Fr | Fr-En | En-Fr | Fr-En |
En-De | De-En | En-De | De-En | |
En-Ro | Ro-En | En-Ro | Ro-En | |
UNMT-ST | En-Fr | Fr-En | En-Fr | Fr-En |
En-De | De-En | En-De | De-En | |
En-Ro | Ro-En | En-Ro | Ro-En |
Evaluation
Generate translations
Input sentences must have the same tokenization and BPE codes than the ones used in the model.
cat input.en.bpe | \
python3 translate.py \
--exp_name translate \
--src_lang en --tgt_lang de \
--model_path trained_model.pth \
--output_path output.de.bpe \
--batch_size 8
Remove bpe
sed -r 's/(@@ )|(@@ ?$)//g' output.de.bpe > output.de.tok
Evaluate
BLEU_SCRIPT_PATH=src/evaluation/multi-bleu.perl
BLEU_SCRIPT_PATH ref.de.tok < output.de.tok