In the readme file, 4 GPUs can achieve a BLEU of 28.35 and even 28.67 when training more epochs.
GPU count | Mixed precision BLEU | fp32 BLEU | Mixed precision training time | fp32 training time
-- | -- | -- | -- | --
8 | 28.69 | 28.43 | 446 min | 1896 min
4 | 28.35 | 28.31 | 834 min | 3733 min
GPU count | Precision | BLEU score | Epochs to train | Training time
-- | -- | -- | -- | --
4 | fp16 | 28.67 | 74 | 1925 min
4 | fp32 | 28.40 | 47 | 5478 min
However, I have run the code with 4 GPUs and I did not modify the code at all but The Best Result I got is 27.63 on my "checkpoint_best.pt" which is epoch 19 in my case. I have run totally 80 epochs and the best BLEU over all those epochs is 28.13 which is not considered as the "checkpoint_best.pt" in the validation process.
I used the following command line to train the model:
nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict
--arch transformer_wmt_en_de_big_t2t
--share-all-embeddings
--optimizer adam
--adam-betas '(0.9, 0.997)'
--adam-eps "1e-9"
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 0.0
--update-freq 2
--warmup-updates 8000
--lr 0.0006
--min-lr 0.0
--dropout 0.1
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 5120
--seed 1
--max-epoch 80
--ignore-case
--fp16
--save-dir /workspace/checkpoints
--distributed-init-method env:// > train.nohup.out &
I also tried different warmup-updates and lr, and the results are similar. The result I got is like:
Test Checkpoint1
| Translated 3003 sentences (84994 tokens) in 25.2s (119.35 sentences/s, 3377.84 tokens/s)
| Generate test with beam=4: BLEU4 = 18.11, 50.2/23.5/12.7/7.2 (BP=1.000, ratio=1.041, syslen=67147, reflen=64512)
Test Checkpoint2
| Translated 3003 sentences (87704 tokens) in 27.5s (109.17 sentences/s, 3188.43 tokens/s)
| Generate test with beam=4: BLEU4 = 21.26, 52.5/26.7/15.5/9.4 (BP=1.000, ratio=1.061, syslen=68450, reflen=64512)
Test Checkpoint3
| Translated 3003 sentences (86611 tokens) in 25.8s (116.61 sentences/s, 3363.17 tokens/s)
| Generate test with beam=4: BLEU4 = 23.91, 55.5/29.5/17.8/11.2 (BP=1.000, ratio=1.040, syslen=67079, reflen=64512)
Test Checkpoint4
| Translated 3003 sentences (86518 tokens) in 25.8s (116.61 sentences/s, 3359.54 tokens/s)
| Generate test with beam=4: BLEU4 = 25.26, 56.7/30.9/19.0/12.3 (BP=1.000, ratio=1.035, syslen=66758, reflen=64512)
Test Checkpoint5
| Translated 3003 sentences (86768 tokens) in 25.7s (116.96 sentences/s, 3379.47 tokens/s)
| Generate test with beam=4: BLEU4 = 25.63, 56.8/31.2/19.4/12.5 (BP=1.000, ratio=1.034, syslen=66698, reflen=64512)
Test Checkpoint6
| Translated 3003 sentences (87220 tokens) in 25.8s (116.21 sentences/s, 3375.30 tokens/s)
| Generate test with beam=4: BLEU4 = 25.98, 56.9/31.5/19.8/12.9 (BP=1.000, ratio=1.042, syslen=67205, reflen=64512)
Test Checkpoint7
| Translated 3003 sentences (87715 tokens) in 25.9s (115.80 sentences/s, 3382.54 tokens/s)
| Generate test with beam=4: BLEU4 = 26.24, 57.2/31.8/20.0/13.0 (BP=1.000, ratio=1.045, syslen=67413, reflen=64512)
Test Checkpoint8
| Translated 3003 sentences (87808 tokens) in 26.8s (111.88 sentences/s, 3271.39 tokens/s)
| Generate test with beam=4: BLEU4 = 26.82, 57.6/32.3/20.5/13.6 (BP=1.000, ratio=1.045, syslen=67444, reflen=64512)
Test Checkpoint9
| Translated 3003 sentences (87394 tokens) in 25.6s (117.26 sentences/s, 3412.38 tokens/s)
| Generate test with beam=4: BLEU4 = 26.63, 57.8/32.2/20.3/13.3 (BP=1.000, ratio=1.039, syslen=67033, reflen=64512)
Test Checkpoint10
| Translated 3003 sentences (86825 tokens) in 25.8s (116.31 sentences/s, 3362.82 tokens/s)
| Generate test with beam=4: BLEU4 = 27.10, 58.1/32.7/20.7/13.7 (BP=1.000, ratio=1.031, syslen=66541, reflen=64512)
Test Checkpoint11
| Translated 3003 sentences (86850 tokens) in 25.9s (116.11 sentences/s, 3358.03 tokens/s)
| Generate test with beam=4: BLEU4 = 27.29, 58.1/32.8/20.9/13.9 (BP=1.000, ratio=1.032, syslen=66563, reflen=64512)
Test Checkpoint12
| Translated 3003 sentences (87137 tokens) in 26.2s (114.74 sentences/s, 3329.31 tokens/s)
| Generate test with beam=4: BLEU4 = 27.28, 58.2/32.9/20.9/13.8 (BP=1.000, ratio=1.035, syslen=66787, reflen=64512)
Test Checkpoint13
| Translated 3003 sentences (86810 tokens) in 25.6s (117.41 sentences/s, 3393.98 tokens/s)
| Generate test with beam=4: BLEU4 = 27.26, 58.3/32.9/20.9/13.8 (BP=1.000, ratio=1.031, syslen=66500, reflen=64512)
Test Checkpoint14
| Translated 3003 sentences (87359 tokens) in 25.8s (116.30 sentences/s, 3383.15 tokens/s)
| Generate test with beam=4: BLEU4 = 27.69, 58.3/33.2/21.3/14.3 (BP=1.000, ratio=1.036, syslen=66830, reflen=64512)
Test Checkpoint15
| Translated 3003 sentences (87415 tokens) in 26.3s (114.33 sentences/s, 3327.98 tokens/s)
| Generate test with beam=4: BLEU4 = 27.37, 58.1/32.9/21.0/14.0 (BP=1.000, ratio=1.038, syslen=66951, reflen=64512)
Test Checkpoint16
| Translated 3003 sentences (87332 tokens) in 26.7s (112.51 sentences/s, 3272.10 tokens/s)
| Generate test with beam=4: BLEU4 = 27.33, 58.1/32.9/21.0/13.9 (BP=1.000, ratio=1.039, syslen=66998, reflen=64512)
Test Checkpoint17
| Translated 3003 sentences (86721 tokens) in 25.9s (116.06 sentences/s, 3351.62 tokens/s)
| Generate test with beam=4: BLEU4 = 27.32, 58.4/33.0/20.9/13.8 (BP=1.000, ratio=1.029, syslen=66385, reflen=64512)
Test Checkpoint18
| Translated 3003 sentences (87388 tokens) in 26.2s (114.71 sentences/s, 3338.08 tokens/s)
| Generate test with beam=4: BLEU4 = 27.57, 58.3/33.1/21.2/14.2 (BP=1.000, ratio=1.038, syslen=66956, reflen=64512)
Test Checkpoint19
| Translated 3003 sentences (86919 tokens) in 25.8s (116.28 sentences/s, 3365.50 tokens/s)
| Generate test with beam=4: BLEU4 = 27.63, 58.6/33.3/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66642, reflen=64512)
Test Checkpoint20
| Translated 3003 sentences (87485 tokens) in 26.1s (115.24 sentences/s, 3357.16 tokens/s)
| Generate test with beam=4: BLEU4 = 27.48, 58.1/33.0/21.1/14.1 (BP=1.000, ratio=1.037, syslen=66924, reflen=64512)
Test Checkpoint21
| Translated 3003 sentences (86993 tokens) in 26.3s (114.07 sentences/s, 3304.46 tokens/s)
| Generate test with beam=4: BLEU4 = 27.77, 58.5/33.3/21.4/14.3 (BP=1.000, ratio=1.032, syslen=66564, reflen=64512)
Test Checkpoint22
| Translated 3003 sentences (87084 tokens) in 25.4s (118.07 sentences/s, 3424.04 tokens/s)
| Generate test with beam=4: BLEU4 = 27.87, 58.6/33.3/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66595, reflen=64512)
Test Checkpoint23
| Translated 3003 sentences (87013 tokens) in 26.4s (113.92 sentences/s, 3300.98 tokens/s)
| Generate test with beam=4: BLEU4 = 27.59, 58.4/33.2/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66626, reflen=64512)
Test Checkpoint24
| Translated 3003 sentences (86741 tokens) in 26.0s (115.49 sentences/s, 3335.84 tokens/s)
| Generate test with beam=4: BLEU4 = 27.98, 58.7/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66379, reflen=64512)
Test Checkpoint25
| Translated 3003 sentences (86884 tokens) in 25.4s (118.05 sentences/s, 3415.42 tokens/s)
| Generate test with beam=4: BLEU4 = 27.94, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66392, reflen=64512)
Test Checkpoint26
| Translated 3003 sentences (86840 tokens) in 26.4s (113.68 sentences/s, 3287.46 tokens/s)
| Generate test with beam=4: BLEU4 = 27.91, 58.7/33.5/21.5/14.4 (BP=1.000, ratio=1.028, syslen=66344, reflen=64512)
Test Checkpoint27
| Translated 3003 sentences (87050 tokens) in 26.2s (114.45 sentences/s, 3317.73 tokens/s)
| Generate test with beam=4: BLEU4 = 27.88, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.030, syslen=66451, reflen=64512)
Test Checkpoint28
| Translated 3003 sentences (86981 tokens) in 25.8s (116.40 sentences/s, 3371.53 tokens/s)
| Generate test with beam=4: BLEU4 = 27.80, 58.7/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66488, reflen=64512)
Test Checkpoint29
| Translated 3003 sentences (86219 tokens) in 25.6s (117.33 sentences/s, 3368.59 tokens/s)
| Generate test with beam=4: BLEU4 = 27.82, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.022, syslen=65941, reflen=64512)
Test Checkpoint30
| Translated 3003 sentences (86879 tokens) in 26.9s (111.61 sentences/s, 3229.04 tokens/s)
| Generate test with beam=4: BLEU4 = 27.88, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512)
Test Checkpoint31
| Translated 3003 sentences (87082 tokens) in 26.6s (112.83 sentences/s, 3271.95 tokens/s)
| Generate test with beam=4: BLEU4 = 28.00, 58.8/33.6/21.6/14.4 (BP=1.000, ratio=1.032, syslen=66570, reflen=64512)
Test Checkpoint32
| Translated 3003 sentences (86677 tokens) in 26.6s (112.93 sentences/s, 3259.43 tokens/s)
| Generate test with beam=4: BLEU4 = 27.98, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.028, syslen=66289, reflen=64512)
Test Checkpoint33
| Translated 3003 sentences (87034 tokens) in 26.2s (114.54 sentences/s, 3319.61 tokens/s)
| Generate test with beam=4: BLEU4 = 28.10, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.032, syslen=66553, reflen=64512)
Test Checkpoint34
| Translated 3003 sentences (87064 tokens) in 26.3s (114.28 sentences/s, 3313.16 tokens/s)
| Generate test with beam=4: BLEU4 = 27.92, 58.4/33.3/21.6/14.4 (BP=1.000, ratio=1.031, syslen=66534, reflen=64512)
Test Checkpoint35
| Translated 3003 sentences (86818 tokens) in 26.6s (112.86 sentences/s, 3262.78 tokens/s)
| Generate test with beam=4: BLEU4 = 28.11, 58.9/33.7/21.7/14.5 (BP=1.000, ratio=1.028, syslen=66336, reflen=64512)
Test Checkpoint36
| Translated 3003 sentences (87037 tokens) in 25.9s (115.89 sentences/s, 3358.98 tokens/s)
| Generate test with beam=4: BLEU4 = 28.18, 58.8/33.6/21.8/14.6 (BP=1.000, ratio=1.031, syslen=66483, reflen=64512)
Test Checkpoint37
| Translated 3003 sentences (86740 tokens) in 25.7s (116.91 sentences/s, 3376.92 tokens/s)
| Generate test with beam=4: BLEU4 = 28.19, 58.9/33.7/21.8/14.6 (BP=1.000, ratio=1.026, syslen=66197, reflen=64512)
Test Checkpoint38
| Translated 3003 sentences (87084 tokens) in 26.1s (115.05 sentences/s, 3336.24 tokens/s)
| Generate test with beam=4: BLEU4 = 28.01, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.032, syslen=66551, reflen=64512)
Test Checkpoint39
| Translated 3003 sentences (86972 tokens) in 27.7s (108.47 sentences/s, 3141.58 tokens/s)
| Generate test with beam=4: BLEU4 = 28.10, 58.7/33.5/21.7/14.6 (BP=1.000, ratio=1.030, syslen=66456, reflen=64512)
Test Checkpoint40
| Translated 3003 sentences (86717 tokens) in 25.7s (116.94 sentences/s, 3376.78 tokens/s)
| Generate test with beam=4: BLEU4 = 27.81, 58.7/33.4/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512)
Test Checkpoint41
| Translated 3003 sentences (86542 tokens) in 26.0s (115.52 sentences/s, 3329.06 tokens/s)
| Generate test with beam=4: BLEU4 = 27.69, 58.9/33.3/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66127, reflen=64512)
Test Checkpoint42
| Translated 3003 sentences (86841 tokens) in 27.1s (110.96 sentences/s, 3208.64 tokens/s)
| Generate test with beam=4: BLEU4 = 27.99, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66329, reflen=64512)
Test Checkpoint43
| Translated 3003 sentences (86986 tokens) in 26.8s (111.92 sentences/s, 3241.95 tokens/s)
| Generate test with beam=4: BLEU4 = 27.81, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512)
Test Checkpoint44
| Translated 3003 sentences (86691 tokens) in 25.6s (117.24 sentences/s, 3384.53 tokens/s)
| Generate test with beam=4: BLEU4 = 28.09, 58.8/33.6/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66162, reflen=64512)
Test Checkpoint45
| Translated 3003 sentences (86845 tokens) in 26.5s (113.44 sentences/s, 3280.52 tokens/s)
| Generate test with beam=4: BLEU4 = 28.00, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512)
Test Checkpoint46
| Translated 3003 sentences (86280 tokens) in 25.7s (116.75 sentences/s, 3354.46 tokens/s)
| Generate test with beam=4: BLEU4 = 28.13, 59.0/33.6/21.7/14.6 (BP=1.000, ratio=1.021, syslen=65860, reflen=64512)
Test Checkpoint47
| Translated 3003 sentences (86857 tokens) in 26.4s (113.64 sentences/s, 3286.92 tokens/s)
| Generate test with beam=4: BLEU4 = 27.77, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.029, syslen=66402, reflen=64512)
Test Checkpoint48
| Translated 3003 sentences (87087 tokens) in 26.0s (115.65 sentences/s, 3353.93 tokens/s)
| Generate test with beam=4: BLEU4 = 27.68, 58.4/33.2/21.3/14.2 (BP=1.000, ratio=1.032, syslen=66576, reflen=64512)
Test Checkpoint49
| Translated 3003 sentences (86627 tokens) in 25.5s (117.97 sentences/s, 3402.95 tokens/s)
| Generate test with beam=4: BLEU4 = 28.02, 59.0/33.6/21.6/14.4 (BP=1.000, ratio=1.026, syslen=66208, reflen=64512)
Test Checkpoint50
| Translated 3003 sentences (86529 tokens) in 25.9s (116.09 sentences/s, 3345.07 tokens/s)
| Generate test with beam=4: BLEU4 = 27.96, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66049, reflen=64512)
Test Checkpoint51
| Translated 3003 sentences (87095 tokens) in 26.2s (114.50 sentences/s, 3320.73 tokens/s)
| Generate test with beam=4: BLEU4 = 27.80, 58.6/33.4/21.4/14.3 (BP=1.000, ratio=1.030, syslen=66471, reflen=64512)
Test Checkpoint52
| Translated 3003 sentences (87160 tokens) in 27.2s (110.54 sentences/s, 3208.27 tokens/s)
| Generate test with beam=4: BLEU4 = 27.89, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66559, reflen=64512)
Test Checkpoint53
| Translated 3003 sentences (86909 tokens) in 26.1s (114.96 sentences/s, 3326.93 tokens/s)
| Generate test with beam=4: BLEU4 = 27.90, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512)
Test Checkpoint54
| Translated 3003 sentences (86785 tokens) in 26.1s (114.94 sentences/s, 3321.61 tokens/s)
| Generate test with beam=4: BLEU4 = 28.05, 58.8/33.6/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66308, reflen=64512)
Test Checkpoint55
| Translated 3003 sentences (86914 tokens) in 25.9s (115.95 sentences/s, 3355.82 tokens/s)
| Generate test with beam=4: BLEU4 = 27.76, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.029, syslen=66376, reflen=64512)
Test Checkpoint56
| Translated 3003 sentences (86775 tokens) in 26.5s (113.27 sentences/s, 3273.16 tokens/s)
| Generate test with beam=4: BLEU4 = 27.75, 58.5/33.2/21.4/14.3 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512)
Test Checkpoint57
| Translated 3003 sentences (86522 tokens) in 26.3s (114.39 sentences/s, 3295.88 tokens/s)
| Generate test with beam=4: BLEU4 = 27.91, 58.9/33.4/21.5/14.3 (BP=1.000, ratio=1.024, syslen=66052, reflen=64512)
Test Checkpoint58
| Translated 3003 sentences (86269 tokens) in 26.1s (114.94 sentences/s, 3301.85 tokens/s)
| Generate test with beam=4: BLEU4 = 27.77, 58.7/33.3/21.4/14.2 (BP=1.000, ratio=1.021, syslen=65893, reflen=64512)
Test Checkpoint59
| Translated 3003 sentences (86738 tokens) in 25.9s (115.78 sentences/s, 3344.27 tokens/s)
| Generate test with beam=4: BLEU4 = 27.96, 58.5/33.4/21.6/14.5 (BP=1.000, ratio=1.029, syslen=66378, reflen=64512)
Test Checkpoint60
| Translated 3003 sentences (86566 tokens) in 25.7s (116.92 sentences/s, 3370.48 tokens/s)
| Generate test with beam=4: BLEU4 = 27.85, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.025, syslen=66151, reflen=64512)
Test Checkpoint61
| Translated 3003 sentences (86785 tokens) in 25.3s (118.91 sentences/s, 3436.47 tokens/s)
| Generate test with beam=4: BLEU4 = 27.74, 58.7/33.3/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66291, reflen=64512)
Test Checkpoint62
| Translated 3003 sentences (86261 tokens) in 25.7s (116.79 sentences/s, 3354.79 tokens/s)
| Generate test with beam=4: BLEU4 = 27.86, 58.8/33.4/21.5/14.3 (BP=1.000, ratio=1.021, syslen=65898, reflen=64512)
Test Checkpoint63
| Translated 3003 sentences (86569 tokens) in 25.1s (119.58 sentences/s, 3447.32 tokens/s)
| Generate test with beam=4: BLEU4 = 27.92, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.025, syslen=66155, reflen=64512)
Test Checkpoint64
| Translated 3003 sentences (86583 tokens) in 25.8s (116.47 sentences/s, 3357.96 tokens/s)
| Generate test with beam=4: BLEU4 = 27.59, 58.5/33.2/21.2/14.1 (BP=1.000, ratio=1.025, syslen=66146, reflen=64512)
Test Checkpoint65
| Translated 3003 sentences (86707 tokens) in 26.2s (114.76 sentences/s, 3313.64 tokens/s)
| Generate test with beam=4: BLEU4 = 27.78, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66294, reflen=64512)
Test Checkpoint66
| Translated 3003 sentences (86478 tokens) in 26.0s (115.55 sentences/s, 3327.54 tokens/s)
| Generate test with beam=4: BLEU4 = 27.63, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66114, reflen=64512)
Test Checkpoint67
| Translated 3003 sentences (86564 tokens) in 25.8s (116.40 sentences/s, 3355.20 tokens/s)
| Generate test with beam=4: BLEU4 = 27.92, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.026, syslen=66200, reflen=64512)
Test Checkpoint68
| Translated 3003 sentences (86548 tokens) in 26.2s (114.58 sentences/s, 3302.20 tokens/s)
| Generate test with beam=4: BLEU4 = 28.08, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.024, syslen=66041, reflen=64512)
Test Checkpoint69
| Translated 3003 sentences (86580 tokens) in 25.9s (116.08 sentences/s, 3346.72 tokens/s)
| Generate test with beam=4: BLEU4 = 28.13, 58.8/33.7/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66178, reflen=64512)
Test Checkpoint70
| Translated 3003 sentences (86448 tokens) in 26.1s (115.01 sentences/s, 3310.94 tokens/s)
| Generate test with beam=4: BLEU4 = 27.88, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.023, syslen=65998, reflen=64512)
Test Checkpoint71
| Translated 3003 sentences (86832 tokens) in 26.0s (115.69 sentences/s, 3345.26 tokens/s)
| Generate test with beam=4: BLEU4 = 27.91, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66355, reflen=64512)
Test Checkpoint72
| Translated 3003 sentences (86550 tokens) in 25.6s (117.18 sentences/s, 3377.25 tokens/s)
| Generate test with beam=4: BLEU4 = 27.95, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66092, reflen=64512)
Test Checkpoint73
| Translated 3003 sentences (86415 tokens) in 25.4s (118.17 sentences/s, 3400.41 tokens/s)
| Generate test with beam=4: BLEU4 = 27.84, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.023, syslen=65990, reflen=64512)
Test Checkpoint74
| Translated 3003 sentences (86251 tokens) in 26.2s (114.65 sentences/s, 3292.82 tokens/s)
| Generate test with beam=4: BLEU4 = 27.97, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.021, syslen=65889, reflen=64512)
Test Checkpoint75
| Translated 3003 sentences (86418 tokens) in 26.1s (115.03 sentences/s, 3310.16 tokens/s)
| Generate test with beam=4: BLEU4 = 27.72, 58.6/33.2/21.3/14.2 (BP=1.000, ratio=1.023, syslen=65971, reflen=64512)
Test Checkpoint76
| Translated 3003 sentences (86474 tokens) in 25.9s (116.04 sentences/s, 3341.50 tokens/s)
| Generate test with beam=4: BLEU4 = 27.63, 58.6/33.2/21.2/14.1 (BP=1.000, ratio=1.023, syslen=66025, reflen=64512)
Test Checkpoint77
| Translated 3003 sentences (86100 tokens) in 25.6s (117.20 sentences/s, 3360.35 tokens/s)
| Generate test with beam=4: BLEU4 = 28.11, 59.1/33.7/21.7/14.5 (BP=1.000, ratio=1.018, syslen=65695, reflen=64512)
Test Checkpoint78
| Translated 3003 sentences (86497 tokens) in 26.2s (114.53 sentences/s, 3298.82 tokens/s)
| Generate test with beam=4: BLEU4 = 27.80, 58.7/33.4/21.4/14.3 (BP=1.000, ratio=1.024, syslen=66073, reflen=64512)
Test Checkpoint79
| Translated 3003 sentences (86905 tokens) in 26.3s (114.22 sentences/s, 3305.35 tokens/s)
| Generate test with beam=4: BLEU4 = 27.69, 58.5/33.2/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66327, reflen=64512)
Test Checkpoint80
| Translated 3003 sentences (86654 tokens) in 26.3s (114.36 sentences/s, 3300.06 tokens/s)
| Generate test with beam=4: BLEU4 = 27.65, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.026, syslen=66219, reflen=64512)
So, why I am not able to achieve the results as reported in the readme file? Could you tell me the command line that you use to run transformer on 4 GPUs?
Another question is that the "Attention is all you need" paper uses 0.1 as the initial learning rate whereas 0.0006 is used here. Why there is such a large difference on learning rate?