Training open neural machine translation models

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Comments
  • Word Alignment Files

    Word Alignment Files

    Hi, I'm retraining existing id-en model with my own training data. To train the model, in the makefile --guided-alignment parameter is passed along with path to word alignment file, but that file is not present in the pre-trained models. Can you share that file?

    Thanks.

    opened by katphlab 7
  • How to get vocab.yml file when doing train->eval->dist

    How to get vocab.yml file when doing train->eval->dist

    Hey,

    First of all I wanted to thank you for this amazing project.

    I followed the instructions in the repo and set up the environment correctly and I can run train and eval as instructed without any problems. Release does not work for me, so I tried dist and that did work and packaged the model into a zip. However, it seems like the Huggingface script for converting Marian models to Pytorch models requires a vocab.yml file which is also present in all the pretrained Opus-MT models but is not present in my zip file - I only have src.vocab and trg.vocab files.

    Could you please explain to me how to get the vocab.yml file, and whether it is done using any make commands or manually?

    Thanks, Best, Oren

    opened by orendar 6
  • [Language Codes] How are models named?

    [Language Codes] How are models named?

    For example, In, cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de Is there a table or some other source for what zh_HK, zh_yue, yue, etc. represent? Is zh_yue is different than yue? is zh_cn different than cn somehow?

    Thanks in advance!

    opened by sshleifer 6
  • Problem Fine-tuning Models using TMX files

    Problem Fine-tuning Models using TMX files

    Hi! I have been working with the opus-mt models available through HuggingFace and was hoping to finetune them using TMX files. At this point, I have been able to install the listed prerequisites and follow the documentation but the tmx-tune recipe is failing. It is failing around the ${TUNED_MODEL}.done target starting around line 520. The LOADMODS environment variable is empty which is causing the failure. The repo contains no documentation mentioning the value of LOADMODS and it is not being set anywhere so I am lost on what it should be.

    More information (not sure if this is related/helpful to answer the question):

    • I have created a Docker container to set up this environment that uses a CUDA based Ubuntu image and currently I am not using GPU for testing but I do have access to one if needed
    • To get to this point, I have had to move files/directories around (i.e moving the tools directory from OPUS-MT-train to OPUS-MT-train/finetune and run the Makefile from that directory)
    • I have tried arbitrarily setting LOADMODS in the make tmx-tune command and the recipe still fails on the ${TUNED_MODEL}.done target because ${TRAIN_SRC}.pre.gz is empty
    opened by hdeval1 5
  • Chinese-English model?

    Chinese-English model?

    Thank you for this great resource, it's a really impressive collection of models.

    I've noticed there are models for zh->fi, zh->sv and zh->de but no model for zh->en (or en->zh for that matter). Since these are quite prominent language pairs, I'm wondering if there are plans to add these in the future? Or am I just looking wrong?

    opened by ales-t 5
  • Using pretrained models for translations

    Using pretrained models for translations

    Hello, I have a doubt regarding the use of the released pretrained models.

    I have a marian server running the opus en-fr model (BPE). I'd like to test the model by translating some sentences of my choice.

    According to model documentation, I have to send to the server preprocessed input. The file preprocess.sh usage is:

    USAGE preprocess.sh langid bpecodes < input > output

    While langid, input, and output are clear to me, I don't understand what should I pass as bpecodes. Can you please point me towards the right direction?

    Thanks in advance

    opened by MickHardins 4
  • en-es / es-en : spm instead of bpe?

    en-es / es-en : spm instead of bpe?

    Hi,

    Do you have spm versions of the tokenization for es-en / en-es models since source and target spm are required to convert to models into pytorch?

    Thank you.

    opened by pentegroom 4
  • Are posted test sets preprocessed?

    Are posted test sets preprocessed?

    In the posted test sets, like https://object.pouta.csc.fi/OPUS-MT-models/jap-en/opus-2020-01-09.test.txt,

    • has the source been run through preprocess.sh?
    • have the system translations and gold been run through postprocess.sh (assuming yes, given lack of _)?
    opened by sshleifer 4
  • Corpus clean up and normalization

    Corpus clean up and normalization

    (This is a question, please redirect me if this is not the right place to ask)

    I observed the test data set and train dataset can be greatly improved if we do an automatic cleanup and normalization(language specific). For example, consider this MT output for en-ml "എന് റെ വീട് ഇന്ത്യയിലാണ്." Here, the space in that bold content is unwanted. This is a known issue in most of the Malayalam content found in web. I found these kind of issues in training and testing data.

    If I want to fix this, where exactly I need to add a cleanup code?

    opened by santhoshtr 4
  • Prop:Help Create own model from scratch or fine tuning pre trained model?

    Prop:Help Create own model from scratch or fine tuning pre trained model?

    My scope create Finnish language chat based on seq2seq model.

    Can you give me some hits to start up. Create own model from scratch or fine tuning pre trained model? Particularly interesting https://huggingface.co/Helsinki-NLP/opus-mt-fi-fi model Maybe it possible to make fine tuning using Finnish language chat pairs

    Kuka sei? /t Mina Alex ................................

    Any hints appreciated. Maybe exist ready made project?

    In fact I use for now: https://medium.com/axel-springer-tech/headliner-easy-training-and-deployment-of-seq2seq-models-2a26508b4dae https://github.com/as-ideas/headliner

    Thanks.

    opened by remotejob 3
  • How to download training data?

    How to download training data?

    It seems like make data is looking for /projappl/nlpl/data/OPUS/*/latest/xml/en-ro.xml.gz I can fix the path, but I think I will still need to download en-ro.xml.gz. Could you provide instructions for how to do that?

    I found the opustools command opus_express -s en -t ro, is that the data the models were trained on?

    opened by sshleifer 3
  • Unable to find current origin/master revision in submodule path

    Unable to find current origin/master revision in submodule path

    When I install I got some issue: Unable to find current origin/master revision in submodule path 'third_party/cpuinfo' Failed to recurse into submodule path 'tools/marian-dev/src/3rd_party/fbgemm' Failed to recurse into submodule path 'tools/browsermt/marian-dev' Failed to recurse into submodule path 'tools/marian-dev'

    I found that cpuinfo and fbgemm repository changed the default "Master branch" to "Main branch", so we need to change the config in file: tools/marian-dev/src/3rd_party/fbgemm/.gitmodules for example, I added the attribute branch = main

    [submodule "third_party/cpuinfo"]
    	path = third_party/cpuinfo
    	url = https://github.com/pytorch/cpuinfo
    	branch = main
    
    [submodule "third_party/googletest"]
    	path = third_party/googletest
    	url = https://github.com/google/googletest
    	branch = main
    
    opened by hthanhbmt 0
  • different sizes of dictionaries in different models

    different sizes of dictionaries in different models

    Hi, I use different tokenizers for different languages:

    Helsinki-NLP/opus-mt-en-de Helsinki-NLP/opus-mt-en-he Helsinki-NLP/opus-mt-en-ru Helsinki-NLP/opus-mt-en-es

    I see that the English parts of the dictionaries are different for example tokenizer_he.tokenize("housekeeper") outputs ['▁housekeeper'] and tokenizer_es.tokenize("housekeeper") outputs ['▁house', 'keeper']

    I want to know what is the reason for this different Was it trained on different dataset? Thank you Bar

    opened by bariluz93 1
  • update Dockerfile.gpu--fixed

    update Dockerfile.gpu--fixed

    1. Dockfile.gpu use nvidia/cuda:9.0-devel base image, the key is invalid in current. Reference https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/.

    2. Solution Update the head string from

    FROM nvidia/cuda:9.0-devel
    ENV LANG=C.UTF-8
    RUN apt update && \
        apt upgrade -y && \
        apt install -y ruby wget git cmake g++ libboost-all-dev \
                       doxygen graphviz libblas-dev libopenblas-dev \
    		   libz-dev libssl-dev zlib1g-dev libbz2-dev liblzma-dev \
    		   libprotobuf9v5 protobuf-compiler libprotobuf-dev \
    		   python3-dev python3-numpy python3-setuptools \
    		   cython3
    

    To the following:

    FROM nvidia/cuda:9.0-devel
    ENV LANG=C.UTF-8
    
    RUN rm -rf /etc/apt/sources.list.d/*
    RUN apt update && apt install gnupg-curl
    RUN apt-key del 7fa2af80
    RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
    RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/7fa2af80.pub
    
    RUN echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
        echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
    RUN apt update && apt upgrade -y && apt install -y ruby wget git cmake g++ libboost-all-dev \
                       doxygen graphviz libblas-dev libopenblas-dev \
    		   libz-dev libssl-dev zlib1g-dev libbz2-dev liblzma-dev \
    		   libprotobuf9v5 protobuf-compiler libprotobuf-dev \
    		   python3-dev python3-numpy python3-setuptools \
    		   cython3
    
    

    After modification, the docker file for gpu can be build successfully.

    opened by gotomypc 0
  • Using OPUS-MT with DeepSpeed

    Using OPUS-MT with DeepSpeed

    Hello,

    I am trying to use OPUS-MT together with DeepSpeed compression (examples can be found at this link https://github.com/microsoft/DeepSpeedExamples under model_compression).

    I am running into an issue where the exact same code if I use t5-small, but if I switch to Helsinki-NLP/opus-mt-zh-en it does not work anymore. The error is:

    Traceback (most recent call last):
      File "translation/run_translation.py", line 686, in <module>
        main()
      File "translation/run_translation.py", line 603, in main
        train_result = trainer.train(resume_from_checkpoint=checkpoint)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 1504, in train
        ignore_keys_for_eval=ignore_keys_for_eval,
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
        tr_loss_step = self.training_step(model, inputs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 2486, in training_step
        loss = self.compute_loss(model, inputs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 2518, in compute_loss
        outputs = model(**inputs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
        outputs = self.parallel_apply(replicas, inputs, kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
        output.reraise()
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
        raise exception
    TypeError: Caught TypeError in replica 0 on device 0.
    Original Traceback (most recent call last):
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
        output = module(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 1455, in forward
        return_dict=return_dict,
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 1229, in forward
        return_dict=return_dict,
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 751, in forward
        embed_pos = self.embed_positions(input_shape)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/deepspeed/compression/basic_layer.py", line 130, in forward
        self.sparse)
      File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/functional.py", line 2199, in embedding
        return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
    TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not torch.Size
    

    Has anyone ever encountered this issue?

    opened by rlenain 0
  • Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model

    Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model

    The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

    from transformers import MarianMTModel, MarianTokenizer
    src_text = [
        "2, 4, 6 etc. are even numbers.",
        "Yes."
    ]
    
    tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3)
    model = MarianMTModel.from_pretrained(MODEL_PATH3)
    translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
    
    for t in translated:
        print( tokenizer.decode(t, skip_special_tokens=True) )
    

    The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

    I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem. Could you take a look at it?

    opened by regpath 0
Owner
Language Technology at the University of Helsinki
Projects and resources developed in the Language Technology Research Group at the University of Helsinki.
Language Technology at the University of Helsinki
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.8k Jan 4, 2023
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 4.8k Feb 18, 2021
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

null 3.4k Dec 27, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 986 Feb 17, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1000 Apr 19, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

Raj Dabre 121 Jan 5, 2023
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machin

Qian Wang 21 Dec 17, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 1, 2023
Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

Yasmin Moslem 29 Jan 5, 2023
null 1 Jun 28, 2022