DataSelection-NMT
Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts
Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.
Our Pre-trained models on Hugging Face
Systems | Link | Systems | Link |
---|---|---|---|
Top1 | Download | Top1 | Download |
Top2+Top1 | Download | Top2 | Download |
Top3+Top2+... | Download | Top3 | Donwload |
Top4+Top3+... | Download | Top4 | Donwload |
Top5+Top4+... | Download | Top5 | Donwload |
Top6+Top5+... | Download | Top6 | Donwload |
How to use
Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.
Follow steps below to translate your sentences:
1. Install the Python package:
pip install --upgrade pip
pip install ctranslate2
2. Download models from our HF repository: You can do this manually or use the following python script:
import requests
url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)
3. Convert the downloaded model:
ct2-opennmt-py-converter --model_path model_path --output_dir output_directory
3. Translate tokenized inputs:
Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.
import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])
or
import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")
To customize the CTranslate2 functions, read this API document.
4. Detokenize the outputs:
Note: you need to detokenize the output with the same sentencepiece model as used in step 3.
tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok
5. Remove the @@ tokens:
cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd
Use grep to check if @@ tokens removed successfully:
grep @@ output._file.detok.postprocessd