Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Chenyang Huang

Last update: Jan 4, 2023

Related tags

Overview

This repo contains the implementation of our paper:

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Paper Link

Replication

Python environment

pip install -e . # under DSLP directory
pip install tensorflow tensorboard sacremoses nltk Ninja omegaconf
pip install 'fuzzywuzzy[speedup]'
pip install hydra-core==1.0.6
pip install sacrebleu==1.5.1
pip install git+https://github.com/dugu9sword/lunanlp.git
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install .

Dataset

We downloaded the distilled data from FairSeq

Preprocessed by

TEXT=wmt14_ende_distill
python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
   --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
   --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary

Or you can download all the binarized files here.

Hyperparameters

	EN<->RO	EN<->DE
--validate-interval-updates	300	500
number of tokens per batch	32K	128K
--dropout	0.3	0.1

Note:

We found that label smoothing for CTC-based models are not useful (at least not with our implementation), it is suggested to keep --label-smoothing as 0 for them.
Dropout rate plays a significant role for GLAT, CMLM, and the Vanilla NAT. On WMT'14 EN->De, for example, the Vanilla NAT with dropout 0.1 reaches 21.18 BLEU; but only gives 19.68 BLEU with dropout 0.3.

Training:

We provide the scripts for replicating the results on WMT'14 EN->DE task. For other tasks, you need to adapt the binary path, --source-lang, --target-lang, and some other hyperparameters accordingly.

GLAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat \ 
   --length-loss-factor 0.1 --pred-length-offset

CMLM with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch glat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 \
   --length-loss-factor 0.1 --pred-length-offset

Vanilla NAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 \
   --length-loss-factor 0.1 --pred-length-offset

Vanilla NAT with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192  --ss-ratio 0.3 --fixed-ss-ratio --masked-loss \ 
   --length-loss-factor 0.1 --pred-length-offset

CTC with DSLP:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192

CTC with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio

Evaluation

Average the last best 5 checkpoints with scripts/average_checkpoints.py, our results are based on either the best checkpoint or the averaged checkpoint, depending on their valid set BLEU.

fairseq-generate data-bin/wmt14.en-de_kd  --path PATH_TO_A_CHECKPOINT \
    --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
    --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 100

Note: 1) Add --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}' if it is CTC based; 2) Change the task to translation_glat if it is GLAT based.

Output

We in addition provide the output of CTC w/ DSLP, CTC w/ DSLP & Mixed Training, Vanilla NAT w/ DSLP, Vanilla NAT w/ DSLP with Mixed Training, GLAT w/ DSLP, and CMLM w/ DSLP for review purpose.

Model	Reference	Hypothesis
CTC w/ DSLP	ref	hyp
CTC w/ DSLP & Mixed Training	ref	hyp
Vanilla NAT w/ DSLP	ref	hyp
Vanilla NAT w/ DSLP & Mixed Training	ref	hyp
GLAT w/ DSLP	ref	hyp
CMLM w/ DSLP	ref	hyp

Note: The output is on WMT'14 EN-DE. The references are paired with hypotheses for each model.

Training Efficiency

We show the training efficiency of our DSLP model based on vanilla NAT model. Specifically, we compared the BLUE socres of vanilla NAT and vanilla NAT with DSLP & Mixed Training on the same traning time (in hours).

As we observed, our DSLP model achieves much higher BLUE scores shortly after the training started (~3 hours). It shows that our DSLP is much more efficient in training, as our model ahieves higher BLUE scores with the same amount of training cost.

We run the experiments with 8 Tesla V100 GPUs. The batch size is 128K tokens, and each model is trained with 300K updates.

Comments

Reproducing Vanilla NAT Baseline

Hi all,

thanks for sharing your code!

I would like to be able to reproduce the Vanilla NAT Baseline (21.18 BLEU on WMT'14 EN-DE). What is the corresponding command?

Btw, the command for Vanilla NAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu \ --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \ --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \ --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d \ --share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \ --fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.1 \ --activation-fn gelu --dropout 0.1 --max-tokens 8192

does not work. I get: train.py: error: unrecognized arguments: --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1

Cheers, Stephan

opened by stephanpeitz 8
Does your implementation for CTC + GLAT work?

I've looked through the code for nat_ctc_glat.py and was wondering how the alignment works for the glancing sampling. For the normal CTC without GLAT this is handled by F.ctc_loss but it seems it's not so straightforward for the GLAT part. I tried to code it up following some of the implementation here as well as in the GLAT repository.

For me, it fails for this check pred_tokens == tgt_tokens in the GLAT part. Which makes sense as the pred_tokens will have the length of the upsampled source from CTC but the tgt_tokens are most likely smaller.

Not sure if it fails using your exact code as well but it would make sense to me, what did you change to make this work?

opened by SirRob1997 7

Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

Thanks for your codes firstly. I try to reproduce the result of CTC with DSLP and Mixed Training, but I get the BLEU as the following:

Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

My scripts are the following:

TEXT=wmt14_ende_distill
python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
   --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
   --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir ori_checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  \ 
   --activation-fn gelu --dropout 0.1  **--max-tokens 4000** **--batch-size 32** --ss-ratio 0.3 --fixed-ss-ratio

fairseq-generate data-bin/wmt14.en-de_kd  --path ori_checkpoints/checkpoint_best.pt \
    --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
    --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 50
    --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}'

Because I used a RTX 3090 GPU, I have to change the batch size and max tokens parameters.

Please tell me how to reproduce your results ~ Very thanks ~

opened by Rexbalaeniceps 5

Training time cost per epoch in GLAT with DSLP

Hi all,

Thanks very much for your awesome code!

I noticed there are some differences between your GLAT implementation and the repo here. I tried both and found that the training time cost increased rapidly during the training (for epoch1, it cost 10 min but for epoch 50, 120min). I wonder if you have encountered this in your experiments and what causes this.

Thanks very much! hemingkx

opened by hemingkx 3
How to install dependencies and run?
I first ran pip install --editable . and ran your training script. The error was ModuleNotFoundError: No module named 'tensorflow'. I found that the tensorflow in file_io was a hack so I removed all the related lines. However, it still produces the following error

File "/tmp/DSLP/fairseq/criterions/__init__.py", line 18, in <module> ( TypeError: cannot unpack non-iterable NoneType object
opened by zkx06111 3

About `--arch glat_sd`

Hi~ I got a new problem when I attempted to train a GLAT with DSLP model.

Following your scripts:

CUDA_VISIBLE_DEVICES=7 python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir glat_dslp_checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat \
   --length-loss-factor 0.1 --pred-length-offset

I got this error:

train.py: error: argument --arch/-a: invalid choice: 'glat_sd' (choose from 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_baevski_wiki103', 'transformer_lm_wiki103', 'transformer_lm_baevski_gbw', 'transformer_lm_gbw', 'transformer_lm_gpt', 'transformer_lm_gpt2_small', 'transformer_lm_gpt2_medium', 'transformer_lm_gpt2_big', 'lightconv', 'lightconv_iwslt_de_en', 'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lightconv_lm', 'lightconv_lm_gbw', 'nat', 'nonautoregressive_transformer_wmt_en_de', 'nat_12d', 'nat_24d', 'nacrf_transformer', 'iterative_nonautoregressive_transformer', 'iterative_nonautoregressive_transformer_wmt_en_de', 'cmlm_transformer', 'cmlm_transformer_wmt_en_de', 'levenshtein_transformer', 'levenshtein_transformer_wmt_en_de', 'levenshtein_transformer_vaswani_wmt_en_de_big', 'levenshtein_transformer_wmt_en_de_big', 'insertion_transformer', 'nat_glat', 'glat_base', 'glat_big', 'glat_16e6d', 'nat_sd_shared', 'nat_sd', 'nat_ctc_sd', 'nat_ctc_cross_layer_hidden_replace_deep_sup', 'nat_ctc_sd_12d', 'nat_ctc_sd_de_24d', 'nat_ctc_s', 'nat_ctc_d', 'nat_sd_glat_base', 'nat_sd_glat', 'nat_sd_glat_12d', 'nat_sd_glat_24d', 'nat_sd_glat_12e', 'glat_s', 'glat_d', 'nat_s', 'nat_s_12d', 'nat_s_24d', 'nat_d', 'nat_d_12d', 'nat_d_24d', 'nat_sd_glat_anneal', 'nat_sd_glat_anneal_12d', 'nat_sd_glat_anneal_24d', 'nat_sd_glat_anneal_12e', 'nat_ctc', 'nat_ctc_fixlen', 'nat_ctc_refine', 'ctc_from_zaixiang', 'cmlm_sd', 'nat_cf', 'nat_md', 'nat_sd_ss', 'nat_sd_glat_ss', 'nat_ctc_sd_ss', 'cmlm_sd_ss', 'transformer_align', 'transformer_wmt_en_de_big_align', 'lstm', 'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'lstm_lm', 's2t_berard', 's2t_berard_256_3_3', 's2t_berard_512_3_2', 's2t_berard_512_5_3', 's2t_transformer', 's2t_transformer_s', 's2t_transformer_sp', 's2t_transformer_m', 's2t_transformer_mp', 's2t_transformer_l', 's2t_transformer_lp', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 'fconv_wmt_en_fr', 'roberta', 'roberta_base', 'roberta_large', 'xlm', 'masked_lm', 'bert_base', 'bert_large', 'xlm_base', 'wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'fconv_self_att', 'fconv_self_att_wp', 'fconv_lm', 'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'transformer_from_pretrained_xlm', 'hf_gpt2', 'hf_gpt2_medium', 'hf_gpt2_large', 'hf_gpt2_xl', 'bart_large', 'bart_base', 'mbart_large', 'mbart_base', 'mbart_base_wmt20', 'dummy_model', 'transformer_lm_megatron', 'transformer_lm_megatron_11b', 'transformer_iwslt_de_en_pipeline_parallel', 'transformer_wmt_en_de_big_pipeline_parallel', 'model_parallel_roberta', 'model_parallel_roberta_base', 'model_parallel_roberta_large')

I found that glat_sd doesn't exist in the options, why is this? By the way, thank you for the previous response, I have achieved ~27 bleu.

opened by Rexbalaeniceps 2

OOM problem with the model nat_ctc_sd_ss

I trained the model "nat_ctc_sd_ss" with the command in the README.md on Tesla V100 GPU, but i got Out of memory problem. Is there anything to be changed? My train command:

python3 train.py $DATA --source-lang en --target-lang de  --save-dir checkpoints/NAT_CTC_DSLP_MT  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio

My training log is as follows:

2022-09-07 20:31:12 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': True, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:15846', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'no_c10d', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'distributed_wrapper': 'DDP', 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 8}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 8192, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 500, 'validate_after_updates': 0, 'fixed_validation_seed': 7, 'disable_validation': False, 'max_tokens_valid': 8192, 'batch_size_valid': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 300000, 'stop_time_hours': 0.0, 'clip_norm': 2.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': 1e-09, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints/NAT_CTC_DSLP_MT', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 500, 'keep_interval_updates': 5, 'keep_last_epochs': 5, 'keep_best_checkpoints': 5, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'bleu', 'maximize_best_checkpoint_metric': True, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'force_no_target': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='nat_ctc_sd_ss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'task': Namespace(_name='translation_lev', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'criterion': Namespace(_name='nat_loss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [en] dictionary: 39842 types
2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [de] dictionary: 39842 types
2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.en
2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.de
2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin valid en-de 3000 examples
2022-09-07 20:31:14 | INFO | fairseq_cli.train | NATransformerModel(
  (encoder): FairseqNATEncoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(39842, 512, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout_module): FairseqDropout()
        (activation_dropout_module): FairseqDropout()
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (decoder): NATransformerDecoder(
    (dropout_module): FairseqDropout()
    (embed_tokens): Embedding(39842, 512, padding_idx=1)
    (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
    (layers): ModuleList(
      (0): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (1): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (2): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (3): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (4): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
      (5): TransformerSharedDecoderLayer(
        (dropout_module): FairseqDropout()
        (self_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (activation_dropout_module): FairseqDropout()
        (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (encoder_attn): MultiheadAttention(
          (dropout_module): FairseqDropout()
          (k_proj): Linear(in_features=512, out_features=512, bias=True)
          (v_proj): Linear(in_features=512, out_features=512, bias=True)
          (q_proj): Linear(in_features=512, out_features=512, bias=True)
          (out_proj): Linear(in_features=512, out_features=512, bias=True)
        )
        (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (fc1): Linear(in_features=512, out_features=2048, bias=True)
        (fc2): Linear(in_features=2048, out_features=512, bias=True)
        (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      )
    )
    (output_projection): Linear(in_features=512, out_features=39842, bias=False)
    (embed_length): Embedding(256, 512)
    (reduce_concat): ModuleList(
      (0): Linear(in_features=1024, out_features=512, bias=False)
      (1): Linear(in_features=1024, out_features=512, bias=False)
      (2): Linear(in_features=1024, out_features=512, bias=False)
      (3): Linear(in_features=1024, out_features=512, bias=False)
      (4): Linear(in_features=1024, out_features=512, bias=False)
    )
  )
)
2022-09-07 20:31:14 | INFO | fairseq_cli.train | task: TranslationLevenshteinTask
2022-09-07 20:31:14 | INFO | fairseq_cli.train | model: NATransformerModel
2022-09-07 20:31:14 | INFO | fairseq_cli.train | criterion: LabelSmoothedDualImitationCriterion
2022-09-07 20:31:14 | INFO | fairseq_cli.train | num. model params: 68,340,736 (num. trained: 68,340,736)
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.0.bias
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.1.bias
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.2.bias
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.3.bias
2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.4.bias
2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes.
2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   0: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   1: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   2: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   3: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   4: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   5: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   6: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | rank   7: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
2022-09-07 20:31:14 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
2022-09-07 20:31:14 | INFO | fairseq_cli.train | max tokens per GPU = 8192 and batch size per GPU = None
2022-09-07 20:31:14 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
2022-09-07 20:31:14 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
2022-09-07 20:31:14 | INFO | fairseq.trainer | loading train data for epoch 1
2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.en
2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.de
2022-09-07 20:31:15 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin train en-de 3961179 examples
2022-09-07 20:31:17 | INFO | fairseq.tasks.translation_lev | Dataset original size: 3961179, filtered size: 3961117
2022-09-07 20:31:18 | INFO | fairseq.trainer | begin training epoch 1
2022-09-07 20:31:23 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 7; 31.75 GiB total capacity; 28.04 GiB already allocated; 1.11 GiB free; 28.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

opened by YudiZh 0

No glat_sd arch

Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!

opened by bbo0924 2
The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

[/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

I have encountered such a problem, I have not modified the original code, may I ask what is the problem

opened by thunder123321 6
ctcdecode install error

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1673, in _run_ninja_build env=env) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "setup.py", line 55, in cmdclass={'build_ext': BuildExtension} File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension depends=ext.depends) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 538, in unix_wrap_ninja_compile with_cuda=with_cuda) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1359, in _write_ninja_file_and_compile_objects error_prefix='Error compiling objects for extension') File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension

env list: torch 1.8 cuda 11.1 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Is my torch version too high?

opened by thunder123321 4

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Related tags

Overview

Replication

Python environment

Dataset

Hyperparameters

Training:

Evaluation

Output

Training Efficiency

Comments

Owner

Chenyang Huang

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

Neural-Machine-Translation - Implementation of revolutionary machine translation models

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

skweak: A software toolkit for weak supervision applied to NLP tasks

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

Labelling platform for text using distant supervision

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

Some embedding layer implementation using ivy library

A deep learning-based translation library built on Huggingface transformers