Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Overview

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Training Efficiency

We show the training efficiency of our DSLP model based on vanilla NAT model. Specifically, we compared the BLUE socres of vanilla NAT and vanilla NAT with DSLP & Mixed Training on the same traning time (in hours).

As we observed, our DSLP model achieves much higher BLUE scores shortly after the training started (~3 hours). It shows that our DSLP is much more efficient in training, as our model ahieves higher BLUE scores with the same amount of training cost.

Efficiency

We run the experiments with 8 Tesla V100 GPUs. The batch size is 128K tokens, and each model is trained with 300K updates.

Replication

We provide the scripts of replicating the results on WMT'14 EN-DE task.

Dataset

We download the distilled data from FairSeq

Preprocessed by

TEXT=wmt14_ende_distill
python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
   --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
   --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary

Training:

GLAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat 

CMLM with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch glat_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 

Vanilla NAT with DSLP

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 

Vanilla NAT with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192  --ss-ratio 0.3 --fixed-ss-ratio --masked-loss

CTC with DSLP:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 

CTC with DSLP and Mixed Training:

python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir checkpoints  --eval-tokenized-bleu \
   --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
   --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
   --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
   --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
   --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
   --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
   --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  \ 
   --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio

Evaluation

fairseq-generate data-bin/wmt14.en-de_kd  --path PATH_TO_A_CHECKPOINT \
    --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
    --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 100

Note: 1) Add --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}' if it is CTC based; 2) Change the task to translation_glat if it is GLAT based.

Output

We in addition provide the output of CTC w/ DSLP, CTC w/ DSLP & Mixed Training, Vanilla NAT w/ DSLP, Vanilla NAT w/ DSLP with Mixed Training, GLAT w/ DSLP, and CMLM w/ DSLP for review purpose.

Model Reference Hypothesis
CTC w/ DSLP ref hyp
CTC w/ DSLP & Mixed Training ref hyp
Vanilla NAT w/ DSLP ref hyp
Vanilla NAT w/ DSLP & Mixed Training ref hyp
GLAT w/ DSLP ref hyp
CMLM w/ DSLP ref hyp

Note: The output is on WMT'14 EN-DE. The references are paired with hypotheses for each model.

Comments
  • Reproducing Vanilla NAT Baseline

    Reproducing Vanilla NAT Baseline

    Hi all,

    thanks for sharing your code!

    I would like to be able to reproduce the Vanilla NAT Baseline (21.18 BLEU on WMT'14 EN-DE). What is the corresponding command?

    Btw, the command for Vanilla NAT with DSLP

    python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de --save-dir checkpoints --eval-tokenized-bleu \ --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \ --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \ --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5 --fixed-validation-seed 7 --ddp-backend=no_c10d \ --share-all-embeddings --decoder-learned-pos --encoder-learned-pos --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \ --fp16 --clip-norm 2.0 --max-update 300000 --task translation_lev --criterion nat_loss --arch nat_sd --noise full_mask \ --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1 --concat-yhat --concat-dropout 0.0 --label-smoothing 0.1 \ --activation-fn gelu --dropout 0.1 --max-tokens 8192

    does not work. I get: train.py: error: unrecognized arguments: --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1

    Cheers, Stephan

    opened by stephanpeitz 8
  • Does your implementation for CTC + GLAT work?

    Does your implementation for CTC + GLAT work?

    I've looked through the code for nat_ctc_glat.py and was wondering how the alignment works for the glancing sampling. For the normal CTC without GLAT this is handled by F.ctc_loss but it seems it's not so straightforward for the GLAT part. I tried to code it up following some of the implementation here as well as in the GLAT repository.

    For me, it fails for this check pred_tokens == tgt_tokens in the GLAT part. Which makes sense as the pred_tokens will have the length of the upsampled source from CTC but the tgt_tokens are most likely smaller.

    Not sure if it fails using your exact code as well but it would make sense to me, what did you change to make this work?

    opened by SirRob1997 7
  • Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

    Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)

    Thanks for your codes firstly. I try to reproduce the result of CTC with DSLP and Mixed Training, but I get the BLEU as the following:

    Generate test with beam=1: BLEU4 = 15.70, 50.9/22.2/11.0/5.7 (BP=0.961, ratio=0.962, syslen=62024, reflen=64481)
    

    My scripts are the following:

    TEXT=wmt14_ende_distill
    python3 fairseq_cli/preprocess.py --source-lang en --target-lang de \
       --trainpref $TEXT/train.en-de --validpref $TEXT/valid.en-de --testpref $TEXT/test.en-de \
       --destdir data-bin/wmt14.en-de_kd --workers 40 --joined-dictionary
    
    python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir ori_checkpoints  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \ 
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \ 
       --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  \ 
       --activation-fn gelu --dropout 0.1  **--max-tokens 4000** **--batch-size 32** --ss-ratio 0.3 --fixed-ss-ratio 
    
    fairseq-generate data-bin/wmt14.en-de_kd  --path ori_checkpoints/checkpoint_best.pt \
        --gen-subset test --task translation_lev --iter-decode-max-iter 0 \
        --iter-decode-eos-penalty 0 --beam 1 --remove-bpe --print-step --batch-size 50
        --plain-ctc --model-overrides '{"ctc_beam_size": 1, "plain_ctc": True}'
    

    Because I used a RTX 3090 GPU, I have to change the batch size and max tokens parameters.

    Please tell me how to reproduce your results ~ Very thanks ~

    opened by Rexbalaeniceps 5
  • Training time cost per epoch in GLAT with DSLP

    Training time cost per epoch in GLAT with DSLP

    Hi all,

    Thanks very much for your awesome code!

    I noticed there are some differences between your GLAT implementation and the repo here. I tried both and found that the training time cost increased rapidly during the training (for epoch1, it cost 10 min but for epoch 50, 120min). I wonder if you have encountered this in your experiments and what causes this.

    Thanks very much! hemingkx

    opened by hemingkx 3
  • How to install dependencies and run?

    How to install dependencies and run?

    I first ran pip install --editable . and ran your training script. The error was ModuleNotFoundError: No module named 'tensorflow'. I found that the tensorflow in file_io was a hack so I removed all the related lines. However, it still produces the following error

      File "/tmp/DSLP/fairseq/criterions/__init__.py", line 18, in <module>
        (
    TypeError: cannot unpack non-iterable NoneType object
    
    opened by zkx06111 3
  • About `--arch glat_sd`

    About `--arch glat_sd`

    Hi~ I got a new problem when I attempted to train a GLAT with DSLP model.

    Following your scripts:

    CUDA_VISIBLE_DEVICES=7 python3 train.py data-bin/wmt14.en-de_kd --source-lang en --target-lang de  --save-dir glat_dslp_checkpoints  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_glat --criterion glat_loss --arch glat_sd --noise full_mask \
       --concat-yhat --concat-dropout 0.0  --label-smoothing 0.1 \
       --activation-fn gelu --dropout 0.1  --max-tokens 8192 --glat-mode glat \
       --length-loss-factor 0.1 --pred-length-offset 
    

    I got this error:

    train.py: error: argument --arch/-a: invalid choice: 'glat_sd' (choose from 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 'transformer_wmt_en_de_big_t2t', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_baevski_wiki103', 'transformer_lm_wiki103', 'transformer_lm_baevski_gbw', 'transformer_lm_gbw', 'transformer_lm_gpt', 'transformer_lm_gpt2_small', 'transformer_lm_gpt2_medium', 'transformer_lm_gpt2_big', 'lightconv', 'lightconv_iwslt_de_en', 'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'lightconv_lm', 'lightconv_lm_gbw', 'nat', 'nonautoregressive_transformer_wmt_en_de', 'nat_12d', 'nat_24d', 'nacrf_transformer', 'iterative_nonautoregressive_transformer', 'iterative_nonautoregressive_transformer_wmt_en_de', 'cmlm_transformer', 'cmlm_transformer_wmt_en_de', 'levenshtein_transformer', 'levenshtein_transformer_wmt_en_de', 'levenshtein_transformer_vaswani_wmt_en_de_big', 'levenshtein_transformer_wmt_en_de_big', 'insertion_transformer', 'nat_glat', 'glat_base', 'glat_big', 'glat_16e6d', 'nat_sd_shared', 'nat_sd', 'nat_ctc_sd', 'nat_ctc_cross_layer_hidden_replace_deep_sup', 'nat_ctc_sd_12d', 'nat_ctc_sd_de_24d', 'nat_ctc_s', 'nat_ctc_d', 'nat_sd_glat_base', 'nat_sd_glat', 'nat_sd_glat_12d', 'nat_sd_glat_24d', 'nat_sd_glat_12e', 'glat_s', 'glat_d', 'nat_s', 'nat_s_12d', 'nat_s_24d', 'nat_d', 'nat_d_12d', 'nat_d_24d', 'nat_sd_glat_anneal', 'nat_sd_glat_anneal_12d', 'nat_sd_glat_anneal_24d', 'nat_sd_glat_anneal_12e', 'nat_ctc', 'nat_ctc_fixlen', 'nat_ctc_refine', 'ctc_from_zaixiang', 'cmlm_sd', 'nat_cf', 'nat_md', 'nat_sd_ss', 'nat_sd_glat_ss', 'nat_ctc_sd_ss', 'cmlm_sd_ss', 'transformer_align', 'transformer_wmt_en_de_big_align', 'lstm', 'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'lstm_lm', 's2t_berard', 's2t_berard_256_3_3', 's2t_berard_512_3_2', 's2t_berard_512_5_3', 's2t_transformer', 's2t_transformer_s', 's2t_transformer_sp', 's2t_transformer_m', 's2t_transformer_mp', 's2t_transformer_l', 's2t_transformer_lp', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 'fconv_wmt_en_fr', 'roberta', 'roberta_base', 'roberta_large', 'xlm', 'masked_lm', 'bert_base', 'bert_large', 'xlm_base', 'wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'fconv_self_att', 'fconv_self_att_wp', 'fconv_lm', 'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'transformer_from_pretrained_xlm', 'hf_gpt2', 'hf_gpt2_medium', 'hf_gpt2_large', 'hf_gpt2_xl', 'bart_large', 'bart_base', 'mbart_large', 'mbart_base', 'mbart_base_wmt20', 'dummy_model', 'transformer_lm_megatron', 'transformer_lm_megatron_11b', 'transformer_iwslt_de_en_pipeline_parallel', 'transformer_wmt_en_de_big_pipeline_parallel', 'model_parallel_roberta', 'model_parallel_roberta_base', 'model_parallel_roberta_large')
    

    I found that glat_sd doesn't exist in the options, why is this? By the way, thank you for the previous response, I have achieved ~27 bleu.

    opened by Rexbalaeniceps 2
  • about inference

    about inference

    hi,thankyou for release code! I have a question about the different pipline between train and inference 。the paper says that in inference stage the predict out of every decoder layer is fed to the next layer 。But in train stage there need to have embedding 、concat and linear op to generate new tensor which is fed to the next layer 。What would be the impact of this operation?

    opened by piaohe20221128 0
  • OOM problem with the model nat_ctc_sd_ss

    OOM problem with the model nat_ctc_sd_ss

    I trained the model "nat_ctc_sd_ss" with the command in the README.md on Tesla V100 GPU, but i got Out of memory problem. Is there anything to be changed? My train command:

    python3 train.py $DATA --source-lang en --target-lang de  --save-dir checkpoints/NAT_CTC_DSLP_MT  --eval-tokenized-bleu \
       --keep-interval-updates 5 --save-interval-updates 500 --validate-interval-updates 500 --maximize-best-checkpoint-metric \
       --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --log-format simple --log-interval 100 \
       --eval-bleu --eval-bleu-detok space --keep-last-epochs 5 --keep-best-checkpoints 5  --fixed-validation-seed 7 --ddp-backend=no_c10d \
       --share-all-embeddings --decoder-learned-pos --encoder-learned-pos  --optimizer adam --adam-betas "(0.9,0.98)" --lr 0.0005 \
       --lr-scheduler inverse_sqrt --stop-min-lr 1e-09 --warmup-updates 10000 --warmup-init-lr 1e-07 --apply-bert-init --weight-decay 0.01 \
       --fp16 --clip-norm 2.0 --max-update 300000  --task translation_lev --criterion nat_loss --arch nat_ctc_sd_ss --noise full_mask \
       --src-upsample-scale 2 --use-ctc-decoder --ctc-beam-size 1  --concat-yhat --concat-dropout 0.0  --label-smoothing 0.0 \
       --activation-fn gelu --dropout 0.1  --max-tokens 8192 --ss-ratio 0.3 --fixed-ss-ratio
    

    My training log is as follows:

    2022-09-07 20:31:12 | INFO | fairseq_cli.train | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': 'simple', 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': True, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:15846', 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'no_c10d', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'distributed_wrapper': 'DDP', 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False, 'distributed_num_procs': 8}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': 8192, 'batch_size': None, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 500, 'validate_after_updates': 0, 'fixed_validation_seed': 7, 'disable_validation': False, 'max_tokens_valid': 8192, 'batch_size_valid': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 300000, 'stop_time_hours': 0.0, 'clip_norm': 2.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0005], 'stop_min_lr': 1e-09, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints/NAT_CTC_DSLP_MT', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 500, 'keep_interval_updates': 5, 'keep_last_epochs': 5, 'keep_best_checkpoints': 5, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'bleu', 'maximize_best_checkpoint_metric': True, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 8}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False, 'force_no_target': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='nat_ctc_sd_ss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'task': Namespace(_name='translation_lev', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'criterion': Namespace(_name='nat_loss', activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, all_gather_list_size=16384, all_layer_drop=False, apply_bert_init=True, arch='nat_ctc_sd_ss', attention_dropout=0.0, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='bleu', bf16=False, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=2.0, concat_dropout=0.0, concat_yhat=True, copy_src_token=False, cpu=False, criterion='nat_loss', cross_self_attention=False, ctc_beam_size=1, ctc_beam_size_train=1, curriculum=0, data='../Enrich_Syn_NAT/data/wmt14_ende_distill/bin', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layerdrop=0, decoder_layers=6, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=8, distributed_wrapper='DDP', dropout=0.1, dropout_anneal=False, dropout_anneal_end_ratio=0, empty_cache_freq=0, encoder_attention_heads=8, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_layerdrop=0, encoder_layers=6, encoder_layers_to_keep=None, encoder_learned_pos=True, encoder_normalize_before=False, eos=2, eval_bleu=True, eval_bleu_args=None, eval_bleu_detok='space', eval_bleu_detok_args=None, eval_bleu_print_samples=True, eval_bleu_remove_bpe='@@ ', eval_tokenized_bleu=True, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_ss_ratio=True, fixed_validation_seed=7, force_detach=False, force_ls=False, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', heartbeat_timeout=-1, inference_decoder_layer=-1, keep_best_checkpoints=5, keep_interval_updates=5, keep_last_epochs=5, label_smoothing=0.0, layer_drop_ratio=0.0, left_pad_source='True', left_pad_target='False', length_loss_factor=0.1, load_alignments=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_format='simple', log_interval=100, lr=[0.0005], lr_scheduler='inverse_sqrt', masked_loss=False, max_epoch=0, max_source_positions=1024, max_target_positions=1024, max_tokens=8192, max_tokens_valid=8192, max_update=300000, maximize_best_checkpoint_metric=True, memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_cross_attention=False, no_empty=False, no_epoch_checkpoints=False, no_last_checkpoints=False, no_progress_bar=False, no_save=False, no_save_optimizer_state=False, no_seed_provided=False, no_token_positional_embeddings=False, noise='full_mask', nprocs_per_node=8, num_batch_buckets=0, num_cross_layer_sample=0, num_shards=1, num_topk=1, num_workers=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plain_ctc=False, pred_length_offset=False, profile=False, quant_noise_pq=0, quant_noise_pq_block_size=8, quant_noise_scalar=0, quantization_config_path=None, repeat_layer=0, required_batch_size_multiple=8, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=True, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_option='hard', save_dir='checkpoints/NAT_CTC_DSLP_MT', save_interval=1, save_interval_updates=500, scoring='bleu', seed=1, sentence_avg=False, sg_length_pred=False, shard_id=0, share_all_embeddings=True, share_attn=False, share_decoder_input_output_embed=False, share_ffn=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, softcopy=False, softcopy_temp=5, softmax_temp=1, source_lang='en', src_embedding_copy=False, src_upsample_scale=2, ss_ratio=0.3, stop_min_lr=1e-09, stop_time_hours=0, suppress_crashes=False, target_lang='de', task='translation_lev', temp_anneal=False, tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tpu=False, train_subset='train', truncate_source=False, unk=3, update_freq=[1], upsample_primary=1, use_bmuf=False, use_ctc_decoder=True, use_old_adam=False, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=500, wandb_project=None, warmup_init_lr=1e-07, warmup_updates=10000, weight_decay=0.01, yhat_posemb=False, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9,0.98)', 'adam_eps': 1e-08, 'weight_decay': 0.01, 'use_old_adam': False, 'tpu': False, 'lr': [0.0005]}, 'lr_scheduler': {'_name': 'inverse_sqrt', 'warmup_updates': 10000, 'warmup_init_lr': 1e-07, 'lr': [0.0005]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [en] dictionary: 39842 types
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | [de] dictionary: 39842 types
    2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.en
    2022-09-07 20:31:12 | INFO | fairseq.data.data_utils | loaded 3,000 examples from: ../data/wmt14_ende_distill/bin/valid.en-de.de
    2022-09-07 20:31:12 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin valid en-de 3000 examples
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | NATransformerModel(
      (encoder): FairseqNATEncoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(39842, 512, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (2): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (3): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (4): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (5): TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (dropout_module): FairseqDropout()
            (activation_dropout_module): FairseqDropout()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
      )
      (decoder): NATransformerDecoder(
        (dropout_module): FairseqDropout()
        (embed_tokens): Embedding(39842, 512, padding_idx=1)
        (embed_positions): LearnedPositionalEmbedding(1026, 512, padding_idx=1)
        (layers): ModuleList(
          (0): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (2): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (3): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (4): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
          (5): TransformerSharedDecoderLayer(
            (dropout_module): FairseqDropout()
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (activation_dropout_module): FairseqDropout()
            (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (encoder_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=512, out_features=512, bias=True)
              (v_proj): Linear(in_features=512, out_features=512, bias=True)
              (q_proj): Linear(in_features=512, out_features=512, bias=True)
              (out_proj): Linear(in_features=512, out_features=512, bias=True)
            )
            (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
            (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          )
        )
        (output_projection): Linear(in_features=512, out_features=39842, bias=False)
        (embed_length): Embedding(256, 512)
        (reduce_concat): ModuleList(
          (0): Linear(in_features=1024, out_features=512, bias=False)
          (1): Linear(in_features=1024, out_features=512, bias=False)
          (2): Linear(in_features=1024, out_features=512, bias=False)
          (3): Linear(in_features=1024, out_features=512, bias=False)
          (4): Linear(in_features=1024, out_features=512, bias=False)
        )
      )
    )
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | task: TranslationLevenshteinTask
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | model: NATransformerModel
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | criterion: LabelSmoothedDualImitationCriterion
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | num. model params: 68,340,736 (num. trained: 68,340,736)
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.0.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.1.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.2.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.3.bias
    2022-09-07 20:31:14 | INFO | fairseq.trainer | detected shared parameter: decoder.output_projection.bias <- decoder.reduce_concat.4.bias
    2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:2 to store for rank: 0
    2022-09-07 20:31:14 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 8 nodes.
    2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   0: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   1: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   2: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   3: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   4: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   5: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   6: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | rank   7: capabilities =  7.0  ; total memory = 31.749 GB ; name = Tesla V100-SXM2-32GB                    
    2022-09-07 20:31:14 | INFO | fairseq.utils | ***********************CUDA enviroments for all 8 workers***********************
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | training on 8 devices (GPUs/TPUs)
    2022-09-07 20:31:14 | INFO | fairseq_cli.train | max tokens per GPU = 8192 and batch size per GPU = None
    2022-09-07 20:31:14 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
    2022-09-07 20:31:14 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/NAT_CTC_DSLP_MT/checkpoint_last.pt
    2022-09-07 20:31:14 | INFO | fairseq.trainer | loading train data for epoch 1
    2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.en
    2022-09-07 20:31:15 | INFO | fairseq.data.data_utils | loaded 3,961,179 examples from: ../data/wmt14_ende_distill/bin/train.en-de.de
    2022-09-07 20:31:15 | INFO | fairseq.tasks.translation | ../data/wmt14_ende_distill/bin train en-de 3961179 examples
    2022-09-07 20:31:17 | INFO | fairseq.tasks.translation_lev | Dataset original size: 3961179, filtered size: 3961117
    2022-09-07 20:31:18 | INFO | fairseq.trainer | begin training epoch 1
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 7; 31.75 GiB total capacity; 28.04 GiB already allocated; 1.11 GiB free; 28.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 0                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Allocations           |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Active allocs         |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    
    2022-09-07 20:31:23 | WARNING | fairseq.trainer | |===========================================================================|
    |                  PyTorch CUDA memory summary, device ID 1                 |
    |---------------------------------------------------------------------------|
    |            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
    |===========================================================================|
    |        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
    |---------------------------------------------------------------------------|
    | Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
    |---------------------------------------------------------------------------|
    | Allocations           |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Active allocs         |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | GPU reserved segments |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Non-releasable allocs |       0    |       0    |       0    |       0    |
    |       from large pool |       0    |       0    |       0    |       0    |
    |       from small pool |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize allocations  |       0    |       0    |       0    |       0    |
    |---------------------------------------------------------------------------|
    | Oversize GPU segments |       0    |       0    |       0    |       0    |
    |===========================================================================|
    
    opened by YudiZh 0
  • No glat_sd arch

    No glat_sd arch

    Hi Chengyang, thanks for your great code! I'm trying to reproduce the GLAT+DSLP model, I checked your given training scripts, but I found there is no "--arch glat_sd" registered model in the code, is it should be "nat_sd_glat"? BTW, what's the meaning of "ss" and "sd"? Does "sd" mean supervised deeply? how about "ss" Thank for your answer!!

    opened by bbo0924 2
  • The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary [/home/nihao/nihao-users2/yuhao/DSLP/env/ctcdecode/ctcdecode/src/ctc_beam_search_decoder.cpp:32] FATAL: "(probs_seq[i].size()) == (vocabulary.size())" check failed. The shape of probs_seq does not match with the shape of the vocabulary Segmentation fault (core dumped)

    I have encountered such a problem, I have not modified the original code, may I ask what is the problem

    opened by thunder123321 6
  • ctcdecode install error

    ctcdecode install error

    ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1673, in _run_ninja_build env=env) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last): File "setup.py", line 55, in cmdclass={'build_ext': BuildExtension} File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension depends=ext.depends) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 538, in unix_wrap_ninja_compile with_cuda=with_cuda) File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1359, in _write_ninja_file_and_compile_objects error_prefix='Error compiling objects for extension') File "/home/nihao/anaconda3/envs/DSLP/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension

    env list: torch 1.8 cuda 11.1 gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Is my torch version too high?

    opened by thunder123321 4
Owner
Chenyang Huang
Stay hungry, stay foolish
Chenyang Huang
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Official implement of Paper:A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sening images

A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images 深度监督影像融合网络DSIFN用于高分辨率双时相遥感影像变化检测 Of

Chenxiao Zhang 135 Dec 19, 2022
Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

Chen Kai 66 Nov 28, 2022
Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Graph-to-Graph Transformers Self-attention models, such as Transformer, have been hugely successful in a wide range of natural language processing (NL

Idiap Research Institute 40 Aug 14, 2022
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 6, 2022
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 170 Dec 27, 2022
EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

Gender Bangs Body Side Pose (Yaw) Lighting Smile Face Shape Lipstick Color Painting Style Pose (Yaw) Pose (Pitch) Zoom & Rotate Flush & Eye Color Mout

Zhenliang He 321 Dec 1, 2022
📦 PyTorch based visualization package for generating layer-wise explanations for CNNs.

Explainable CNNs ?? Flexible visualization package for generating layer-wise explanations for CNNs. It is a common notion that a Deep Learning model i

Ashutosh Hathidara 183 Dec 15, 2022
Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

Hello from magnus Magnus provides four capabilities for data teams: Compute execution plan: A DAG representation of work that you want to get done. In

null 12 Feb 8, 2022
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 5, 2022
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 67 Nov 14, 2022
SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling Reference Main paper to be cited (Di Wu et al., 2020) @article

Moore 34 Nov 3, 2022
A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

Fast Symbolic Regression Symbolic Regression is a non-linear, non-parametric Machine Learning method capable of modeling complex data sets. fastsr aim

VAMSHI CHOWDARY 3 Jun 22, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
PyTorch implementation of ShapeConv: Shape-aware Convolutional Layer for RGB-D Indoor Semantic Segmentation.

Shape-aware Convolutional Layer (ShapeConv) PyTorch implementation of ShapeConv: Shape-aware Convolutional Layer for RGB-D Indoor Semantic Segmentatio

Hanchao Leng 82 Dec 29, 2022
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021
Improving 3D Object Detection with Channel-wise Transformer

"Improving 3D Object Detection with Channel-wise Transformer" Thanks for the OpenPCDet, this implementation of the CT3D is mainly based on the pcdet v

Hualian Sheng 107 Dec 20, 2022
"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

Yuanhao Cai 274 Jan 5, 2023