Train Opus-MT models
This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.
Pre-trained models
The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.
Quickstart
Setting up:
git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install
Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):
make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release
More information is available in the documentation linked below.
Documentation
- Installation and setup
- Details about tasks and recipes
- Information about back-translation
- Information about Fine-tuning models
- How to generate pivot-language-based translations
Tutorials
References
Please, cite the following paper if you use OPUS-MT software and models:
@InProceedings{TiedemannThottingal:EAMT2020,
author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
year = {2020},
address = {Lisbon, Portugal}
}
Acknowledgements
None of this would be possible without all the great open source software including
- GNU/Linux tools
- Marian-NMT
- eflomal
... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...
We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.