Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Overview

This repo contains the implementation of our paper:

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Paper Link

Replication

Python environment

pip install -e . # under DSLP directory
pip install tensorflow tensorboard sacremoses nltk Ninja omegaconf
pip install 'fuzzywuzzy[speedup]'
pip install hydra-core==1.0.6
pip install sacrebleu==1.5.1
pip install git+https://github.com/dugu9sword/lunanlp.git
git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install .

Dataset

We downloaded the distilled data from FairSeq

Preprocessed by

TEXT=wmt14_ende_distill
python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
   --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
   --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary

Or you can download all the binarized files here.

Hyperparameters

EN<->RO EN<->DE
--validate-interval-updates 300 500
number of tokens per batch 32K 128K
--dropout 0.3 0.1

Note:

  1. We found that label smoothing for CTC-based models are not useful (at least not with our implementation), it is suggested to keep --label-smoothing as 0 for them.
  2. Dropout rate plays a significant role for GLAT, CMLM, and the Vanilla NAT. On WMT'14 EN->De, for example, the Vanilla NAT with dropout 0.1 reaches 21.18 BLEU; but only gives 19.68 BLEU with dropout 0.3.

Training:

We provide the scripts for replicating the results on WMT'14 EN->DE task. For other tasks, you need to adapt the binary path, --source-lang, --target-lang, and some other hyperparameters accordingly.

GLAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat \ 
   --length-loss-factor 0.1 --pred-length-offset 

CMLM with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch glat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 \
   --length-loss-factor 0.1 --pred-length-offset 

Vanilla NAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 \
   --length-loss-factor 0.1 --pred-length-offset 

Vanilla NAT with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192  --ss-ratio 0.3 --fixed-ss-ratio --masked-loss \ 
   --length-loss-factor 0.1 --pred-length-offset 

CTC with DSLP:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 

CTC with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio

Evaluation

Average the last best 5 checkpoints with scripts/average_checkpoints.py, our results are based on either the best checkpoint or the averaged checkpoint, depending on their valid set BLEU.

fairseq-generate data-bin/wmt14.en-de_kd  --path PATH_TO_A_CHECKPOINT \
    --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
    --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 100

Note: 1) Add --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}' if it is CTC based; 2) Change the task to translation_glat if it is GLAT based.

Output

We in addition provide the output of CTC w/ DSLP, CTC w/ DSLP & Mixed Training, Vanilla NAT w/ DSLP, Vanilla NAT w/ DSLP with Mixed Training, GLAT w/ DSLP, and CMLM w/ DSLP for review purpose.

Model Reference Hypothesis
CTC w/ DSLP ref hyp
CTC w/ DSLP & Mixed Training ref hyp
Vanilla NAT w/ DSLP ref hyp
Vanilla NAT w/ DSLP & Mixed Training ref hyp
GLAT w/ DSLP ref hyp
CMLM w/ DSLP ref hyp

Note: The output is on WMT'14 EN-DE. The references are paired with hypotheses for each model.

Training Efficiency

We show the training efficiency of our DSLP model based on vanilla NAT model. Specifically, we compared the BLUE socres of vanilla NAT and vanilla NAT with DSLP & Mixed Training on the same traning time (in hours).

As we observed, our DSLP model achieves much higher BLUE scores shortly after the training started (~3 hours). It shows that our DSLP is much more efficient in training, as our model ahieves higher BLUE scores with the same amount of training cost.

Efficiency

We run the experiments with 8 Tesla V100 GPUs. The batch size is 128K tokens, and each model is trained with 300K updates.

Comments
  • Reproducing Vanilla NAT Baseline

    Reproducing Vanilla NAT Baseline

    Hi all,

    thanks for sharing your code!

    I would like to be able to reproduce the Vanilla NAT Baseline (21.18 BLEU on WMT'14 EN-DE). What is the corresponding command?

    Btw, the command for Vanilla NAT with DSLP

    python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu \ --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \ --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \ --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d \ --share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \ --fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.1 \ --activation-fn gelu --dropout 0.1 --max-tokens 8192

    does not work. I get: train.py: error: unrecognized arguments: --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1

    Cheers, Stephan

    opened by stephanpeitz 8
  • Does your implementation for CTC + GLAT work?

    Does your implementation for CTC + GLAT work?

    I've looked through the code for nat_ctc_glat.py and was wondering how the alignment works for the glancing sampling. For the normal CTC without GLAT this is handled by F.ctc_loss but it seems it's not so straightforward for the GLAT part. I tried to code it up following some of the implementation here as well as in the GLAT repository.

    For me, it fails for this check pred_tokens == tgt_tokens in the GLAT part. Which makes sense as the pred_tokens will have the length of the upsampled source from CTC but the tgt_tokens are most likely smaller.

    Not sure if it fails using your exact code as well but it would make sense to me, what did you change to make this work?

    opened by SirRob1997 7
  • Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

    Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

    Thanks for your codes firstly. I try to reproduce the result of CTC with DSLP and Mixed Training, but I get the BLEU as the following:

    Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)
    

    My scripts are the following:

    TEXT=wmt14_ende_distill
    python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
       --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
       --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary
    
    python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir ori_checkpoints  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
       --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  \ 
       --activation-fn gelu --dropout 0.1  **--max-tokens 4000** **--batch-size 32** --ss-ratio 0.3 --fixed-ss-ratio 
    
    fairseq-generate data-bin/wmt14.en-de_kd  --path ori_checkpoints/checkpoint_best.pt \
        --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
        --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 50
        --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}'
    

    Because I used a RTX 3090 GPU, I have to change the batch size and max tokens parameters.

    Please tell me how to reproduce your results ~ Very thanks ~

    opened by Rexbalaeniceps 5
  • Training time cost per epoch in GLAT with DSLP

    Training time cost per epoch in GLAT with DSLP

    Hi all,

    Thanks very much for your awesome code!

    I noticed there are some differences between your GLAT implementation and the repo here. I tried both and found that the training time cost increased rapidly during the training (for epoch1, it cost 10 min but for epoch 50, 120min). I wonder if you have encountered this in your experiments and what causes this.

    Thanks very much! hemingkx

    opened by hemingkx 3
  • How to install dependencies and run?

    How to install dependencies and run?

    I first ran pip install --editable . and ran your training script. The error was ModuleNotFoundError: No module named 'tensorflow'. I found that the tensorflow in file_io was a hack so I removed all the related lines. However, it still produces the following error

      File "/tmp/DSLP/fairseq/criterions/__init__.py", line 18, in <module>
        (
    TypeError: cannot unpack non-iterable NoneType object
    
    opened by zkx06111 3
  • About `--arch glat_sd`

    About `--arch glat_sd`

    Hi~ I got a new problem when I attempted to train a GLAT with DSLP model.

    Following your scripts:

    CUDA_VISIBLE_DEVICES=7 python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir glat_dslp_checkpoints  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \
       --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \
       --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat \
       --length-loss-factor 0.1 --pred-length-offset 
    

    I got this error:

    train.py: error: argument --arch/-a: invalid choice: 'glat_sd' (choose from 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_baevski_wiki103', 'transformer_lm_wiki103', 'transformer_lm_baevski_gbw', 'transformer_lm_gbw', 'transformer_lm_gpt', 'transformer_lm_gpt2_small', 'transformer_lm_gpt2_medium', 'transformer_lm_gpt2_big', 'lightconv', 'lightconv_iwslt_de_en', 'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lightconv_lm', 'lightconv_lm_gbw', 'nat', 'nonautoregressive_transformer_wmt_en_de', 'nat_12d', 'nat_24d', 'nacrf_transformer', 'iterative_nonautoregressive_transformer', 'iterative_nonautoregressive_transformer_wmt_en_de', 'cmlm_transformer', 'cmlm_transformer_wmt_en_de', 'levenshtein_transformer', 'levenshtein_transformer_wmt_en_de', 'levenshtein_transformer_vaswani_wmt_en_de_big', 'levenshtein_transformer_wmt_en_de_big', 'insertion_transformer', 'nat_glat', 'glat_base', 'glat_big', 'glat_16e6d', 'nat_sd_shared', 'nat_sd', 'nat_ctc_sd', 'nat_ctc_cross_layer_hidden_replace_deep_sup', 'nat_ctc_sd_12d', 'nat_ctc_sd_de_24d', 'nat_ctc_s', 'nat_ctc_d', 'nat_sd_glat_base', 'nat_sd_glat', 'nat_sd_glat_12d', 'nat_sd_glat_24d', 'nat_sd_glat_12e', 'glat_s', 'glat_d', 'nat_s', 'nat_s_12d', 'nat_s_24d', 'nat_d', 'nat_d_12d', 'nat_d_24d', 'nat_sd_glat_anneal', 'nat_sd_glat_anneal_12d', 'nat_sd_glat_anneal_24d', 'nat_sd_glat_anneal_12e', 'nat_ctc', 'nat_ctc_fixlen', 'nat_ctc_refine', 'ctc_from_zaixiang', 'cmlm_sd', 'nat_cf', 'nat_md', 'nat_sd_ss', 'nat_sd_glat_ss', 'nat_ctc_sd_ss', 'cmlm_sd_ss', 'transformer_align', 'transformer_wmt_en_de_big_align', 'lstm', 'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'lstm_lm', 's2t_berard', 's2t_berard_256_3_3', 's2t_berard_512_3_2', 's2t_berard_512_5_3', 's2t_transformer', 's2t_transformer_s', 's2t_transformer_sp', 's2t_transformer_m', 's2t_transformer_mp', 's2t_transformer_l', 's2t_transformer_lp', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 'fconv_wmt_en_fr', 'roberta', 'roberta_base', 'roberta_large', 'xlm', 'masked_lm', 'bert_base', 'bert_large', 'xlm_base', 'wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'fconv_self_att', 'fconv_self_att_wp', 'fconv_lm', 'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'transformer_from_pretrained_xlm', 'hf_gpt2', 'hf_gpt2_medium', 'hf_gpt2_large', 'hf_gpt2_xl', 'bart_large', 'bart_base', 'mbart_large', 'mbart_base', 'mbart_base_wmt20', 'dummy_model', 'transformer_lm_megatron', 'transformer_lm_megatron_11b', 'transformer_iwslt_de_en_pipeline_parallel', 'transformer_wmt_en_de_big_pipeline_parallel', 'model_parallel_roberta', 'model_parallel_roberta_base', 'model_parallel_roberta_large')
    

    I found that glat_sd doesn't exist in the options, why is this? By the way, thank you for the previous response, I have achieved ~27 bleu.

    opened by Rexbalaeniceps 2
  • OOM problem with the model nat_ctc_sd_ss

    OOM problem with the model nat_ctc_sd_ss

    I trained the model "nat_ctc_sd_ss" with the command in the README.md on Tesla V100 GPU, but i got Out of memory problem. Is there anything to be changed? My train command:

    python3 train.py $DATA --source-lang en --target-lang de  --save-dir checkpoints/NAT_CTC_DSLP_MT  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \
       --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \
       --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio
    

    My training log is as follows:

    2022-09-07 20:31:12 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': True, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:15846', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'no_c10d', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'distributed_wrapper': 'DDP', 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 8}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 8192, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 500, 'validate_after_updates': 0, 'fixed_validation_seed': 7, 'disable_validation': False, 'max_tokens_valid': 8192, 'batch_size_valid': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 300000, 'stop_time_hours': 0.0, 'clip_norm': 2.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': 1e-09, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints/NAT_CTC_DSLP_MT', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 500, 'keep_interval_updates': 5, 'keep_last_epochs': 5, 'keep_best_checkpoints': 5, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'bleu', 'maximize_best_checkpoint_metric': True, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'force_no_target': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='nat_ctc_sd_ss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'task': Namespace(_name='translation_lev', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'criterion': Namespace(_name='nat_loss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [en] dictionary: 39842 types
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [de] dictionary: 39842 types
    2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.en
    2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.de
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin valid en-de 3000 examples
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | NATransformerModel(
      (encoder): FairseqNATEncoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(39842, 512, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (2): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (3): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (4): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (5): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (decoder): NATransformerDecoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(39842, 512, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (2): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (3): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (4): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (5): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
        (output_projection): Linear(in_features=512, out_features=39842, bias=False)
        (embed_length): Embedding(256, 512)
        (reduce_concat): ModuleList(
          (0): Linear(in_features=1024, out_features=512, bias=False)
          (1): Linear(in_features=1024, out_features=512, bias=False)
          (2): Linear(in_features=1024, out_features=512, bias=False)
          (3): Linear(in_features=1024, out_features=512, bias=False)
          (4): Linear(in_features=1024, out_features=512, bias=False)
        )
      )
    )
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | task: TranslationLevenshteinTask
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | model: NATransformerModel
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | criterion: LabelSmoothedDualImitationCriterion
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | num. model params: 68,340,736 (num. trained: 68,340,736)
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.0.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.1.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.2.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.3.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.4.bias
    2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
    2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes.
    2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   0: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   1: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   2: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   3: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   4: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   5: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   6: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   7: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | max tokens per GPU = 8192 and batch size per GPU = None
    2022-09-07 20:31:14 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
    2022-09-07 20:31:14 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
    2022-09-07 20:31:14 | INFO | fairseq.trainer | loading train data for epoch 1
    2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.en
    2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.de
    2022-09-07 20:31:15 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin train en-de 3961179 examples
    2022-09-07 20:31:17 | INFO | fairseq.tasks.translation_lev | Dataset original size: 3961179, filtered size: 3961117
    2022-09-07 20:31:18 | INFO | fairseq.trainer | begin training epoch 1
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 7; 31.75 GiB total capacity; 28.04 GiB already allocated; 1.11 GiB free; 28.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 0                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Allocations           |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Active allocs         |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 1                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Allocations           |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Active allocs         |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    
    opened by YudiZh 0
  • No glat_sd arch

    No glat_sd arch

    Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!

    opened by bbo0924 2
  • The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    I have encountered such a problem, I have not modified the original code, may I ask what is the problem

    opened by thunder123321 6
  • ctcdecode install error

    ctcdecode install error

    ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1673, in _run_ninja_build env=env) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last): File "setup.py", line 55, in cmdclass={'build_ext': BuildExtension} File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension depends=ext.depends) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 538, in unix_wrap_ninja_compile with_cuda=with_cuda) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1359, in _write_ninja_file_and_compile_objects error_prefix='Error compiling objects for extension') File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension

    env list: torch 1.8 cuda 11.1 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Is my torch version too high?

    opened by thunder123321 4
Owner
Chenyang Huang
Stay hungry, stay foolish
Chenyang Huang
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 5, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

null 760 Jan 3, 2023
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Parv Bhatt 1 Jan 1, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

null 172 Dec 29, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
skweak: A software toolkit for weak supervision applied to NLP tasks

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

Norsk Regnesentral (Norwegian Computing Center) 850 Dec 28, 2022
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

null 245 Aug 5, 2022
A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and ?? Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

Boyuan Zhang 4 Oct 7, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

null 922 Dec 31, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large ?? GitHub Repository ?? Documentat

Xing Han Lu 244 Dec 30, 2022