Training open neural machine translation models

Language Technology at the University of Helsinki

Last update: Jan 3, 2023

Related tags

Overview

Train Opus-MT models

This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Makefile but documentation needs to be improved. Also, the targets require a specific environment and right now only work well on the CSC HPC cluster in Finland.

Pre-trained models

The subdirectory models contains information about pre-trained models that can be downloaded from this project. They are distribted with a CC-BY 4.0 license license. More pre-trained models trained with the OPUS-MT training pipeline are available from the Tatoeba translation challenge also under a CC-BY 4.0 license license.

Quickstart

Setting up:

git clone https://github.com/Helsinki-NLP/OPUS-MT-train.git
git submodule update --init --recursive --remote
make install

Training a multilingual NMT model (Finnish and Estonian to Danish, Swedish and English):

make SRCLANGS="fi et" TRGLANGS="da sv en" train
make SRCLANGS="fi et" TRGLANGS="da sv en" eval
make SRCLANGS="fi et" TRGLANGS="da sv en" release

More information is available in the documentation linked below.

Documentation

Tutorials

References

Please, cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Acknowledgements

None of this would be possible without all the great open source software including

GNU/Linux tools
Marian-NMT
eflomal

... and many other tools like terashuf, pigz, jq, Moses SMT, fast_align, sacrebleu ...

We would also like to acknowledge the support by the University of Helsinki, the IT Center of Science CSC, the funding through projects in the EU Horizon 2020 framework (FoTran, MeMAD, ELG) and the contributors to the open collection of parallel corpora OPUS.

Comments

Word Alignment Files

Hi, I'm retraining existing id-en model with my own training data. To train the model, in the makefile --guided-alignment parameter is passed along with path to word alignment file, but that file is not present in the pre-trained models. Can you share that file?

Thanks.

opened by katphlab 7
How to get vocab.yml file when doing train->eval->dist

Hey,

First of all I wanted to thank you for this amazing project.

I followed the instructions in the repo and set up the environment correctly and I can run train and eval as instructed without any problems. Release does not work for me, so I tried dist and that did work and packaged the model into a zip. However, it seems like the Huggingface script for converting Marian models to Pytorch models requires a vocab.yml file which is also present in all the pretrained Opus-MT models but is not present in my zip file - I only have src.vocab and trg.vocab files.

Could you please explain to me how to get the vocab.yml file, and whether it is done using any make commands or manually?

Thanks, Best, Oren

opened by orendar 6
[Language Codes] How are models named?

For example, In, cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de Is there a table or some other source for what zh_HK, zh_yue, yue, etc. represent? Is zh_yue is different than yue? is zh_cn different than cn somehow?

Thanks in advance!

opened by sshleifer 6
Problem Fine-tuning Models using TMX files
Hi! I have been working with the opus-mt models available through HuggingFace and was hoping to finetune them using TMX files. At this point, I have been able to install the listed prerequisites and follow the documentation but the tmx-tune recipe is failing. It is failing around the ${TUNED_MODEL}.done target starting around line 520. The LOADMODS environment variable is empty which is causing the failure. The repo contains no documentation mentioning the value of LOADMODS and it is not being set anywhere so I am lost on what it should be.

More information (not sure if this is related/helpful to answer the question):

I have created a Docker container to set up this environment that uses a CUDA based Ubuntu image and currently I am not using GPU for testing but I do have access to one if needed

To get to this point, I have had to move files/directories around (i.e moving the tools directory from OPUS-MT-train to OPUS-MT-train/finetune and run the Makefile from that directory)

I have tried arbitrarily setting LOADMODS in the make tmx-tune command and the recipe still fails on the ${TUNED_MODEL}.done target because ${TRAIN_SRC}.pre.gz is empty
opened by hdeval1 5
Chinese-English model?

Thank you for this great resource, it's a really impressive collection of models.

I've noticed there are models for zh->fi, zh->sv and zh->de but no model for zh->en (or en->zh for that matter). Since these are quite prominent language pairs, I'm wondering if there are plans to add these in the future? Or am I just looking wrong?

opened by ales-t 5
Using pretrained models for translations

Hello, I have a doubt regarding the use of the released pretrained models.

I have a marian server running the opus en-fr model (BPE). I'd like to test the model by translating some sentences of my choice.

According to model documentation, I have to send to the server preprocessed input. The file preprocess.sh usage is:

USAGE preprocess.sh langid bpecodes < input > output

While langid, input, and output are clear to me, I don't understand what should I pass as bpecodes. Can you please point me towards the right direction?

Thanks in advance

opened by MickHardins 4
en-es / es-en : spm instead of bpe?

Hi,

Do you have spm versions of the tokenization for es-en / en-es models since source and target spm are required to convert to models into pytorch?

Thank you.

opened by pentegroom 4
Are posted test sets preprocessed?
In the posted test sets, like https://object.pouta.csc.fi/OPUS-MT-models/jap-en/opus-2020-01-09.test.txt,

has the source been run through preprocess.sh?

have the system translations and gold been run through postprocess.sh (assuming yes, given lack of _)?
opened by sshleifer 4
Corpus clean up and normalization

(This is a question, please redirect me if this is not the right place to ask)

I observed the test data set and train dataset can be greatly improved if we do an automatic cleanup and normalization(language specific). For example, consider this MT output for en-ml "എന് റെ വീട് ഇന്ത്യയിലാണ്." Here, the space in that bold content is unwanted. This is a known issue in most of the Malayalam content found in web. I found these kind of issues in training and testing data.

If I want to fix this, where exactly I need to add a cleanup code?

opened by santhoshtr 4
Prop:Help Create own model from scratch or fine tuning pre trained model?

My scope create Finnish language chat based on seq2seq model.

Can you give me some hits to start up. Create own model from scratch or fine tuning pre trained model? Particularly interesting https://huggingface.co/Helsinki-NLP/opus-mt-fi-fi model Maybe it possible to make fine tuning using Finnish language chat pairs

Kuka sei? /t Mina Alex ................................

Any hints appreciated. Maybe exist ready made project?

In fact I use for now: https://medium.com/axel-springer-tech/headliner-easy-training-and-deployment-of-seq2seq-models-2a26508b4dae https://github.com/as-ideas/headliner

Thanks.

opened by remotejob 3
How to download training data?

It seems like make data is looking for /projappl/nlpl/data/OPUS/*/latest/xml/en-ro.xml.gz I can fix the path, but I think I will still need to download en-ro.xml.gz. Could you provide instructions for how to do that?

I found the opustools command opus_express -s en -t ro, is that the data the models were trained on?

opened by sshleifer 3
Unable to find current origin/master revision in submodule path
When I install I got some issue: Unable to find current origin/master revision in submodule path 'third_party/cpuinfo' Failed to recurse into submodule path 'tools/marian-dev/src/3rd_party/fbgemm' Failed to recurse into submodule path 'tools/browsermt/marian-dev' Failed to recurse into submodule path 'tools/marian-dev'

I found that cpuinfo and fbgemm repository changed the default "Master branch" to "Main branch", so we need to change the config in file: tools/marian-dev/src/3rd_party/fbgemm/.gitmodules for example, I added the attribute branch = main

[submodule "third_party/cpuinfo"] path = third_party/cpuinfo url = https://github.com/pytorch/cpuinfo branch = main [submodule "third_party/googletest"] path = third_party/googletest url = https://github.com/google/googletest branch = main
opened by hthanhbmt 0
different sizes of dictionaries in different models

Hi, I use different tokenizers for different languages:

Helsinki-NLP/opus-mt-en-de Helsinki-NLP/opus-mt-en-he Helsinki-NLP/opus-mt-en-ru Helsinki-NLP/opus-mt-en-es

I see that the English parts of the dictionaries are different for example tokenizer_he.tokenize("housekeeper") outputs ['▁housekeeper'] and tokenizer_es.tokenize("housekeeper") outputs ['▁house', 'keeper']

I want to know what is the reason for this different Was it trained on different dataset? Thank you Bar

opened by bariluz93 1

update Dockerfile.gpu--fixed

Dockfile.gpu use nvidia/cuda:9.0-devel base image, the key is invalid in current. Reference https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/.
Solution Update the head string from

FROM nvidia/cuda:9.0-devel
ENV LANG=C.UTF-8
RUN apt update && \
    apt upgrade -y && \
    apt install -y ruby wget git cmake g++ libboost-all-dev \
                   doxygen graphviz libblas-dev libopenblas-dev \
		   libz-dev libssl-dev zlib1g-dev libbz2-dev liblzma-dev \
		   libprotobuf9v5 protobuf-compiler libprotobuf-dev \
		   python3-dev python3-numpy python3-setuptools \
		   cython3

To the following:

FROM nvidia/cuda:9.0-devel
ENV LANG=C.UTF-8

RUN rm -rf /etc/apt/sources.list.d/*
RUN apt update && apt install gnupg-curl
RUN apt-key del 7fa2af80
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/7fa2af80.pub

RUN echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
    echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
RUN apt update && apt upgrade -y && apt install -y ruby wget git cmake g++ libboost-all-dev \
                   doxygen graphviz libblas-dev libopenblas-dev \
		   libz-dev libssl-dev zlib1g-dev libbz2-dev liblzma-dev \
		   libprotobuf9v5 protobuf-compiler libprotobuf-dev \
		   python3-dev python3-numpy python3-setuptools \
		   cython3

After modification, the docker file for gpu can be build successfully.

opened by gotomypc 0

Using OPUS-MT with DeepSpeed

Hello,

I am trying to use OPUS-MT together with DeepSpeed compression (examples can be found at this link https://github.com/microsoft/DeepSpeedExamples under model_compression).

I am running into an issue where the exact same code if I use t5-small, but if I switch to Helsinki-NLP/opus-mt-zh-en it does not work anymore. The error is:

Traceback (most recent call last):
  File "translation/run_translation.py", line 686, in <module>
    main()
  File "translation/run_translation.py", line 603, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 1504, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 1742, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 2486, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/trainer.py", line 2518, in compute_loss
    outputs = model(**inputs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 1455, in forward
    return_dict=return_dict,
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 1229, in forward
    return_dict=return_dict,
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/transformers/models/marian/modeling_marian.py", line 751, in forward
    embed_pos = self.embed_positions(input_shape)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/deepspeed/compression/basic_layer.py", line 130, in forward
    self.sparse)
  File "/home/CORP/r.lenain/miniconda3/envs/mt_opus-mt/lib/python3.7/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not torch.Size

Has anyone ever encountered this issue?

opened by rlenain 0

Wrong tokenizer/vocab for the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model
The translation result from English to Korean using the 'Helsinki-NLP/opus-mt-tc-big-en-ko' model does not make sense at all

from transformers import MarianMTModel, MarianTokenizer src_text = [ "2, 4, 6 etc. are even numbers.", "Yes." ] tokenizer = MarianTokenizer.from_pretrained(MODEL_PATH3) model = MarianMTModel.from_pretrained(MODEL_PATH3) translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) for t in translated: print( tokenizer.decode(t, skip_special_tokens=True) )

The result is not ['2, 4, 6 등은 짝수입니다.', '그래'] as in the example, but ['그들은,우리는,우리는 모자입니다. 신뢰할 수 있습니다.', 'ATP입니다.'] which does not make sense at all.

I tried some more sentences and believe that correct tokenizer or vocab file can correct this problem. Could you take a look at it?
opened by regpath 0

Owner

Language Technology at the University of Helsinki

Projects and resources developed in the Language Technology Research Group at the University of Helsinki.

GitHub

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

5.8k Jan 4, 2023

Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

4.8k Feb 18, 2021

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

3.4k Dec 27, 2022

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

748 Jan 6, 2023

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

1.1k Dec 27, 2022

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

986 Feb 17, 2021

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

1000 Apr 19, 2021

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

1.5k Dec 28, 2022

Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

121 Jan 5, 2023

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

261 Nov 12, 2022

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

20 Dec 25, 2022

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machin

21 Dec 17, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

1.6k Jan 1, 2023

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

Local cross-platform machine translation GUI, based on CTranslate2

DesktopTranslator Local cross-platform machine translation GUI, based on CTranslate2 Download Windows Installer You can either download a ready-made W

29 Jan 5, 2023

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

Seq2seq_attn Use the Seq2Seq method to implement machine translation and use the

1 Jun 28, 2022