Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Xinwei Geng

Last update: Dec 25, 2022

Related tags

Overview

RewriteNAT

This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressive Neural Machine Translation". RewriteNAT is a iterative NAT model which utilizes a locator component to explicitly learn to rewrite the erroneous translation pieces during iterative decoding.

Dependencies

Pytorch = 1.2
Fairseq = 0.9

Preprocessing

All the datasets are tokenized using the scripts from Moses except for Chinese with Jieba tokenizer, and splitted into subword units using BPE. The tokenized datasets are binaried using the script binaried.sh as follows:

python preprocess.py \
    --source-lang ${src} --target-lang ${tgt} \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir data-bin/${dataset} --thresholdtgt 0 --thresholdsrc 0 \ 
    --workers 64 --joined-dictionary

Train

All the models are run on 8 Tesla V100 GPUs for 300,000 updates with an effective batch size of 128,000 tokens apart from En→Fr where we make 500,000 updates to account for the data size. The training scripts train.rewrite.nat.sh is configured as follows:

python train.py \
    data-bin/${dataset} \
    --source-lang ${src} --target-lang ${tgt} \
    --save-dir ${save_dir} \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion rewrite_nat_loss \
    --arch rewrite_nonautoregressive_transformer \
    --noise full_mask \
    ${share_all_embeddings} \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \ 
    --max-tokens 4000 \
    --save-interval-updates 10000 \
    --max-update ${step} \
    --update-freq 4 \ 
    --fp16 \
    --save-interval ${save_interval} \
    --discriminator-layers 6 \ 
    --train-max-iter ${max_iter} \
    --roll-in-g sample \
    --roll-in-d oracle \
    --imitation-g \
    --imitation-d \
    --discriminator-loss-factor ${discriminator_weight} \
    --no-share-discriminator \
    --generator-scale ${generator_scale} \
    --discriminator-scale ${discriminator_scale} \

Evaluation

We evaluate performance with BLEU for all language pairs, except for En→>Zh, where we use SacreBLEU. The testing scripts test.rewrite.nat.sh is utilized to generate the translations, as follows:

python generate.py \                                            
    data-bin/${dataset} \                                          
    --source-lang ${src} --target-lang ${tgt} \                    
    --gen-subset ${subset} \                                       
    --task translation_lev \                                       
    --path ${save_dir}/${dataset}/checkpoint_average_${suffix}.pt \
    --iter-decode-max-iter ${max_iter} \                           
    --iter-decode-with-beam ${beam} \                              
    --iter-decode-p ${iter_p} \                                    
    --beam 1 --remove-bpe \                                        
    --batch-size 50\                                               
    --print-step \                                                 
    --quiet

Citation

Please cite as:

@inproceedings{geng-etal-2021-learning,
    title = "Learning to Rewrite for Non-Autoregressive Neural Machine Translation",
    author = "Geng, Xinwei and Feng, Xiaocheng and Qin, Bing",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.265",
    pages = "3297--3308",
}

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sockeye This package contains the Sockeye project, an open-source sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet

986 Feb 17, 2021

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

1000 Apr 19, 2021

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

1.5k Dec 28, 2022

Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

121 Jan 5, 2023

Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki

167 Jan 3, 2023

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machin

21 Dec 17, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

32 Oct 14, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

Tencent Minority-Mandarin Translation Team

42 Dec 20, 2022

Comments

How to install dependencies?

Hi all,

Thanks for your awesome work and codes!

I tried to run the code and used the following commands to build the environment:

conda create -n RewirteNAT python=3.6
conda activate RewirteNAT
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
cd RewirteNAT
pip install --editable ./

When pip install --editable ./ ran, I got errors like this:

    ERROR: Command errored out with exit status 1:
     command: /home/azureuser/miniconda3/envs/rewrite/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/azureuser/rewrite/setup.py'"'"'; __file__='"'"'/home/azureuser/rewrite/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /home/azureuser/rewrite/
    Complete output (14 lines):
    running develop
    running egg_info
    creating fairseq.egg-info
    writing fairseq.egg-info/PKG-INFO
    writing dependency_links to fairseq.egg-info/dependency_links.txt
    writing entry points to fairseq.egg-info/entry_points.txt
    writing requirements to fairseq.egg-info/requires.txt
    writing top-level names to fairseq.egg-info/top_level.txt
    writing manifest file 'fairseq.egg-info/SOURCES.txt'
    reading manifest file 'fairseq.egg-info/SOURCES.txt'
    writing manifest file 'fairseq.egg-info/SOURCES.txt'
    running build_ext
    cythoning fairseq/data/data_utils_fast.pyx to fairseq/data/data_utils_fast.cpp
    error: /home/azureuser/rewrite/fairseq/data/data_utils_fast.pyx
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/azureuser/miniconda3/envs/rewrite/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/azureuser/rewrite/setup.py'"'"'; __file__='"'"'/home/azureuser/rewrite/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

Could you give me some instructions to build your environment?

Thanks, hemingkx

opened by hemingkx 3

ModuleNotFoundError: No module named 'fairseq.data.append_token_dataset'

Hi~ thank you for sharing the codes.

I installed the Dependencies by the following commands:

conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
cd RewirteNAT
pip install --editable ./

following the Issue 1, but I got this error when I try to run preprocess.py.

(rewrite_nat) [wbxu@cu10 RewriteNAT]$ bash preprocess.sh
Traceback (most recent call last):
  File "preprocess.py", line 7, in <module>
    from fairseq_cli.preprocess import cli_main
  File "/data/wbxu/RewriteNAT/fairseq_cli/preprocess.py", line 18, in <module>
    from fairseq import options, tasks, utils
  File "/data/wbxu/RewriteNAT/fairseq/__init__.py", line 9, in <module>
    import fairseq.criterions  # noqa
  File "/data/wbxu/RewriteNAT/fairseq/criterions/__init__.py", line 10, in <module>
    from fairseq.criterions.fairseq_criterion import FairseqCriterion
  File "/data/wbxu/RewriteNAT/fairseq/criterions/fairseq_criterion.py", line 10, in <module>
    from fairseq import metrics, utils
  File "/data/wbxu/RewriteNAT/fairseq/utils.py", line 21, in <module>
    from fairseq.modules import gelu, gelu_accurate
  File "/data/wbxu/RewriteNAT/fairseq/modules/__init__.py", line 9, in <module>
    from .character_token_embedder import CharacterTokenEmbedder
  File "/data/wbxu/RewriteNAT/fairseq/modules/character_token_embedder.py", line 13, in <module>
    from fairseq.data import Dictionary
  File "/data/wbxu/RewriteNAT/fairseq/data/__init__.py", line 12, in <module>
    from .append_token_dataset import AppendTokenDataset
ModuleNotFoundError: No module named 'fairseq.data.append_token_dataset'

Could you tell how to deal with this error? Thank you very much~

opened by Rexbalaeniceps 2

About Performance

Hi, Thanks for your awesome work! When I use the default hyperparameters to train on WMT2014'En-De Data, I got:

train-max-iter=2, 3 days on 8 * V100, SacreBleu 18.0 (1 iters), 24.1 (2 iters), 25.5 (2 iters) and 25.6 (10 iters).
train-max-iter=4, 6 days on 8 * V100, SacreBleu 17.0 (1 iters), 24.1 (2 iters), 25.9 (5 iters) and 26.0 (10 iters).

I wonder if there is something wrong with my training and testing scripts. Here are my scripts:

Training:

max_iter=4

src=en
tgt=de

step=300000

share_all_embeddings="--share-all-embeddings"

save_interval=1

python train.py \
    ${dataset} \
    --source-lang ${src} --target-lang ${tgt} \
    --save-dir /mnt/exp/project \
    --ddp-backend=no_c10d \
    --task translation_lev \
    --criterion rewrite_nat_loss \
    --arch rewrite_nonautoregressive_transformer \
    --noise full_mask \
    ${share_all_embeddings} \
    --optimizer adam --adam-betas '(0.9,0.98)' \
    --lr 0.0005 --lr-scheduler inverse_sqrt \
    --min-lr '1e-09' --warmup-updates 10000 \
    --warmup-init-lr '1e-07' --label-smoothing 0.1 \
    --dropout 0.3 --weight-decay 0.01 \
    --decoder-learned-pos \
    --encoder-learned-pos \
    --length-loss-factor 0.1 \
    --apply-bert-init \
    --log-format 'simple' --log-interval 100 \
    --fixed-validation-seed 7 \
    --max-tokens 4000 \
    --save-interval-updates 10000 \
    --max-update ${step} \
    --update-freq 4 \
    --fp16 \
    --discriminator-layers 6 \
    --train-max-iter ${max_iter} \
    --roll-in-g sample \
    --roll-in-d oracle \
    --imitation-g \
    --imitation-d \
    --no-share-discriminator \
    --reset-optimizer \
    --reset-meters \
    --reset-dataloader \
    --reset-lr-scheduler

Testing:

src=en
tgt=de

subset=test

max_iter=1

beam=1

iter_p=0.5

python generate.py \
    ${dataset} \
    --source-lang ${src} --target-lang ${tgt} \
    --gen-subset ${subset} \
    --task translation_lev \
    --criterion rewrite_nat_loss \
    --path ${save_dir} \
    --iter-decode-max-iter ${max_iter} \
    --iter-decode-with-beam ${beam} \
    --iter-decode-p ${iter_p} \
    --beam 1 --remove-bpe \
    --batch-size 25 \
    --print-step

Thanks very much! hemingkx

opened by hemingkx 1

About training

Hi, When I use the default hyperparameters to train on IWSLT 14 DE-EN distill datasets: I got this We both try train_max_iter as 2 or 4, but i always meet the above problem, i wonder if i have some errors or could you give some advice?

opened by LitterBrother-Xiao 0

Owner

Xinwei Geng

Ph.D. student working on improving Neural Machine Translation with Reinforcement Learning @HIT-SCIR

GitHub

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

1 Feb 17, 2022

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 5, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

67 Nov 14, 2022

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

237 Jan 2, 2023

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Related tags

Overview

RewriteNAT

Dependencies

Preprocessing

Train

Evaluation

Citation

You might also like...

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Phrase-Based & Neural Unsupervised Machine Translation

Yet Another Neural Machine Translation Toolkit

Training open neural machine translation models

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Comments

How to install dependencies?

ModuleNotFoundError: No module named 'fairseq.data.append_token_dataset'

About Performance

About training

Owner

Xinwei Geng

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

Open Source Neural Machine Translation in PyTorch

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Open Source Neural Machine Translation in PyTorch