Yet Another Neural Machine Translation Toolkit

Overview

YANMTT

YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom of this README. Although the name says that it is yet another toolkit, it was written with the purpose of better understanding of the flow of training, starting from data pre-processing, sharding, batching, distributed training and decoding. There is a significant emphashis on multilingualism and on cross-lingual learning.

List of features:

  1. Basic NMT pre-training, fine-tuning, decoding
    Distributed training (tested on up to 48 GPUs. We dont have that much money.).
    Mixed precision training (optimization issues on multiple GPUs).
    Tempered softmax training, entropy maximization training.
    Joint training using monolingual and parallel corpora.
    MBART pre-training with cross-lingual constraints.
    Sentence representation and attention extraction.
    Scoring translations using trained NMT models. (for reranking, filtering or quality estimation)
  2. Multilingual training
    Fine-grained control over checkpoint saving for optimising per language pair performance.
  3. Fine-grained parameter transfer
    Remap embeddings and layers between pre-trained and fine-tuned models.
    Eliminate compoents or layers prior to decoding or fine-tuning.
  4. Model compression
    Training compact models from scratch via recurrently stacked layers (similar to what is used in ALBERT).
    Distillation of pre-trained and fine-tuned models. Distillation styles supported: label cross-entropy, attention cross-entropy, layer similarity.
  5. Simultaneous NMT
    Simulated Wait-k NMT where we train and decode wait-K models or decode full-sentence models using wait-k.
  6. Multi-source and Document NMT
    Vanilla multi-source with two input sentences belonging to different languages.
    Document level NMT where one input is the current sentence and the other one is the context.
    Can be combined with wait-k NMT

Prerequisites (core):
Python v3.6 Pytorch v1.7.1
HuggingFace Transformers v4.3.2 (install the modified copy of the transformers library provided with this toolkit)
tensorflow-gpu v2.3.0
sentencepiece v0.1.95 (you will need to go to https://github.com/google/sentencepiece and install it as the spm_train binary will be used later)
gputil v1.4.0
cuda 10.0/10.1/10.2 (tested on 10.0)

How to install:

  1. Clone the repo and go to the toolkit directory via: "git clone https://github.com/prajdabre/yanmtt && cd yanmtt"
  2. Create a virtual environment with python3.6 via and activate it via: "virtualenv -p /usr/bin/python3.6 py36 && source py36/bin/activate"
  3. Update pip via "pip install pip --upgrade" and then install the required packages via: "pip install -r requirements.txt"
  4. Install the modified version of transformers provided along with this repo by: "cd transformers && python setup.py install"
  5. Modify the "create_autotokenizer.sh" file by specifying the correct path to sentencepiece trainer ("spm_train") in line 8
  6. Set the python path to the local transformers repo by: PYTHONPATH=$PYTHONPATH:/path/to/this/toolkit/transformers
  7. Make sure that the PATH and LD_LIBRARY_PATH variables point to the appropriate CUDA folders (bin and lib64/lib respectively)
  8. Whever you do a git pull and the files in the transformers repo has been updated remember to run "python setup.py install" to update the compiled python scripts

Scripts and their functionality:

  1. create_autotokenizer.sh and create_autotokenizer.py: These scripts govern the creation of a unigram SPM or BPE tokenizer. The shell script creates the subword segmenter using sentencepiece which can make both SPM and BPE models. All you need is a monolingual corpus for the languages you are interested in. The python script wraps this around an AlbertTokenizer (for SPM) or MBartTokenizer (for BPE), adds special user defined tokens and saves a configuration file for use in the future via an AutoTokenizer.
    Usage: see examples/create_tokenizer.sh

  2. pretrain_nmt.py: This is used to train an MBART model. At the very least you need a monolingual corpus for the languages you are interested in and a tokenizer trained for those languages. This script can also be used to do joint MBART style training jointly with regular NMT training although the NMT training is rather basic because there is no evaluation during training. If you want to do advanced NMT training then you should use the "train_nmt.py" script. Ultimately, you should not use the outcome of this script to perform final translations. Additional advanced usages involve: simulated wait-k simultaneous NMT, knowledge distillation, fine-tuning pre-existing MBART models with fine-grained control over what should be initialized or tuned etc. Read the code and the command line arguments for a better understanding of the advanced features.
    Usage: see examples/train_mbart_model.sh

  3. train_nmt.py: This is used to either train a NMT model from scratch or fine-tune a pre-existing MBART or NMT model. At the very least you need a parallel corpus (preferrably split into train, dev and test sets although we can make do with only a train set) for the language pairs you are interested in. There are several advanced features such as: simulated wait-k simultaneous NMT, knowledge distillation, fine-grained control over what should be initialized or tuned, document NMT, multi-source NMT, multilingual NMT training.
    Usage: see examples/train_or_fine_tune_model.sh

  4. decode_model.py: This is used to decode sentences using a trained model. Additionally you can do translation pair scoring, forced decoding, forced alignment (experimental), encoder/decoder representation extraction and alignment visualization.
    Usage: see examples/decode_or_probe_model.sh

  5. common_utils.py: This contains all housekeeping functions such as corpora splitting, batch generation, loss computation etc. Do take a look at all the methods since you may need to modify them.

  6. average_checkpoints.py: You can average the specified checkpoints using either arithmetic or geometric averaging.
    Usage: see examples/avergage_model_checkpoints.sh

  7. gpu_blocker.py: This is used to temporarily occupy a gpu in case you use a shared GPU environment. Run this in the background before launching the training processes so that while the training scripts are busy doing preprocessing like sharding or model loading, the GPU you aim for is not occupied by someone else. Usage will be shown in the example scripts for training.

Note:

  1. Whenever running the example usage scripts simply run them as examples/scriptname.sh from the root directory of the toolkit
  2. The data under examples/data is taken from https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ and is released the ALT Parallel Corpus as a Creative Commons Attribution 4.0 International (CC BY 4.0)

License and copyright:

  1. MIT licence for code that I wrote.
  2. Apache licence for modifications or additions to the huggingface code.

Copyright 2021 National Institute of Information and Communication Technology (Raj Dabre)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contact:
Contact me (Raj Dabre) at [email protected] or [email protected]

Backstory: Why I made this toolkit

Despite the fact that I enjoy coding, I never really pushed myself throughout my Masters and Ph.D. towards writing a self contained toolkit. I had always known that coding is an important part of research and although I had made plenty of meaningful changes to several code bases, I never felt like I owned any of those changes. Fast forward to 2020 where I wanted to play with MBART/BART/MASS. It would have been easy to use fairseq or tensor2tensor but then again the feeling of lack of ownership would remain. Huggingface provides a lot of implementations but (at the time) had no actual script to easily do MBART pre-training. All I had was this single comment "https://github.com/huggingface/transformers/issues/5096#issuecomment-645860271". After a bit of hesitation I decided to get my hands dirty and make a quick notebook for MBART pretraining. That snowballed into me writing my own pipeline for data sharding, preprocessing and training. Since I was at it I wrote a pipeline for tine tuning. Why not go further and write a pipeline for decoding and analysis? Fine-grained control over fine-tuning? Distillation? Multi-source NMT? Document NMT? Simultaneous Wait-K NMT? 3 months later I ended up with this toolkit which I wanted to share with everyone. Since I have worked in low-resource MT and efficent MT this toolkit will mostly contain implementations that somehow involve transfer learning, compression/distillation, simultaneous NMT. I am pretty sure its not as fast or perfect like the ones written by the awesome people at GAFA but I will be more than happy if a few people use my toolkit.

Issues
  • AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    After I follow the installation and run examples/train_mbart_model.sh, I get the below error.

    Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 630, in run_demo() File "pretrain_nmt.py", line 627, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/root/yanmtt/pretrain_nmt.py", line 359, in model_create_load_run_save if mod_compute.additional_lm_logits is not None: AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    What may be going wrong? The version of transformers I have is 4.3.2.

    opened by pruksmhc 6
  • Tokenization issue with pretrained model

    Tokenization issue with pretrained model

    I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

    The command is below: python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

    The error is: Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['en'] Shuffling corpus! Traceback (most recent call last): File "pretrain_nmt.py", line 628, in run_demo() File "pretrain_nmt.py", line 625, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora. File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call **kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus **kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus first_ids = get_input_ids(text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids tokens = self.tokenize(text, **kwargs) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize tokenized_text = split_on_tokens(no_split_token, text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens for token in tokenized_text File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in for token in tokenized_text TypeError: _tokenize() takes 2 positional arguments but 5 were given

    Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

    However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

    opened by pruksmhc 4
  • Continue training on pre-trained BART model

    Continue training on pre-trained BART model

    Hi,

    First thanks for the work on this repo !

    Now, I have some quite specific requirements on training BART model and I see several of your comments (on Fairseq and/or huggingface) pointing to here.

    Before deep diving into your code, I'm curious how easily I might use it for my need. I try to:

    • use pre-trained mBart (available on fairseq/hugging face, specifically Barthez model)
    • continue mono-linguistic training w/ BART objective i.e. denoising.

    Most code/script examples are aimed at finetuning the model, which usually exclude the denoising part.

    Thanks for any insights!

    opened by pltrdy 1
Owner
Raj Dabre
Researcher at NICT Japan. Working on low resource Machine Translation. Wants to collab with researchers interested in adversarial and reinforcement learning.
Raj Dabre
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Jul 17, 2021
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 228 Feb 17, 2021
Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

Ashutosh Sathe 117 Nov 18, 2021
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 499 Nov 21, 2021
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 5.3k Nov 23, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1k Nov 29, 2021
Open Source Neural Machine Translation in PyTorch

OpenNMT-py: Open-Source Neural Machine Translation OpenNMT-py is the PyTorch version of the OpenNMT project, an open-source (MIT) neural machine trans

OpenNMT 4.8k Feb 18, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 986 Feb 17, 2021
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1000 Apr 19, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.4k Nov 29, 2021
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 240 Nov 18, 2021
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 94 Nov 25, 2021
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 368 Nov 22, 2021
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 310 Feb 1, 2021
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 15 Sep 14, 2021
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

null 1.7k Nov 28, 2021
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 15 Nov 17, 2021