Hi! Thank you for providing the code. I am facing issues reproducing the results from the paper.
Starting training from the pre-trained model provided, avg accuracy scores are low.
I am pasting the log:
DeepRL-Grounding-master$ python a3c_main.py --num-processes 16 --evaluate 0 --load saved/pretrained_model --difficulty easy
2 Loading model ... saved/pretrained_model
9 Loading model ... saved/pretrained_model
Loading model ... saved/pretrained_model
14 Loading model ... saved/pretrained_model
5 Loading model ... saved/pretrained_model
12 Loading model ... saved/pretrained_model
11 Loading model ... saved/pretrained_model
6 Loading model ... saved/pretrained_model
8 Loading model ... saved/pretrained_model
0 Loading model ... saved/pretrained_model
10 Loading model ... saved/pretrained_model
15 Loading model ... saved/pretrained_model
1 Loading model ... saved/pretrained_model
13 Loading model ... saved/pretrained_model
4 Loading model ... saved/pretrained_model
3 Loading model ... saved/pretrained_model
7 Loading model ... saved/pretrained_model
Time 00h 39m 33s, Avg Reward 0.392, Avg Accuracy 0.42, Avg Ep length 19.9, Best Reward 0.0
Time 01h 21m 19s, Avg Reward 0.26, Avg Accuracy 0.3, Avg Ep length 21.66, Best Reward 0.392
Time 02h 04m 29s, Avg Reward 0.324, Avg Accuracy 0.34, Avg Ep length 21.9, Best Reward 0.392
Time 02h 48m 33s, Avg Reward 0.308, Avg Accuracy 0.32, Avg Ep length 22.92, Best Reward 0.392
Time 03h 29m 33s, Avg Reward 0.376, Avg Accuracy 0.38, Avg Ep length 21.22, Best Reward 0.392
Time 04h 14m 27s, Avg Reward 0.288, Avg Accuracy 0.3, Avg Ep length 23.64, Best Reward 0.392
Time 04h 58m 49s, Avg Reward 0.228, Avg Accuracy 0.24, Avg Ep length 23.74, Best Reward 0.392
Time 05h 41m 42s, Avg Reward 0.276, Avg Accuracy 0.28, Avg Ep length 23.64, Best Reward 0.392
Time 06h 25m 22s, Avg Reward 0.336, Avg Accuracy 0.34, Avg Ep length 23.26, Best Reward 0.392
Time 07h 06m 41s, Avg Reward 0.376, Avg Accuracy 0.38, Avg Ep length 21.96, Best Reward 0.392
Time 07h 47m 06s, Avg Reward 0.416, Avg Accuracy 0.42, Avg Ep length 21.44, Best Reward 0.392
Time 08h 27m 26s, Avg Reward 0.38, Avg Accuracy 0.38, Avg Ep length 22.06, Best Reward 0.416
When training the model from scratch in easy mode with -0.005 living cost and 0 living cost, the log was:
python a3c_main.py --num-processes 16 --evaluate 0 --difficulty easy
Time 00h 11m 08s, Avg Reward 0.004, Avg Accuracy 0.16, Avg Ep length 7.98, Best Reward 0.0
Time 00h 19m 36s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.56, Best Reward 0.004
Time 00h 27m 11s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 4.68, Best Reward 0.004
Time 00h 33m 53s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.0, Best Reward 0.004
Time 00h 40m 24s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.04
Time 00h 47m 14s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.04
Time 00h 53m 58s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.04
Time 01h 00m 38s, Avg Reward -0.08, Avg Accuracy 0.1, Avg Ep length 4.0, Best Reward 0.04
Time 01h 07m 26s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.0, Best Reward 0.04
Time 01h 14m 05s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.0, Best Reward 0.04
Time 01h 20m 46s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 4.0, Best Reward 0.04
Time 01h 28m 01s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.5, Best Reward 0.04
Time 01h 35m 43s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.94, Best Reward 0.04
Time 01h 43m 23s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.112
Time 01h 51m 28s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 5.0, Best Reward 0.112
Time 01h 58m 52s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.74, Best Reward 0.112
Time 02h 06m 37s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.98, Best Reward 0.112
Time 02h 14m 08s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.7, Best Reward 0.112
Time 02h 20m 59s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 4.04, Best Reward 0.112
Time 02h 28m 27s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 4.6, Best Reward 0.112
Time 02h 35m 23s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.26, Best Reward 0.112
Time 02h 42m 11s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.02, Best Reward 0.112
Time 02h 49m 04s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.16, Best Reward 0.112
Time 02h 57m 03s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 4.88, Best Reward 0.112
Time 03h 04m 38s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.7, Best Reward 0.112
Time 03h 11m 32s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.18, Best Reward 0.112
Time 03h 18m 16s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.0, Best Reward 0.112
Time 03h 25m 05s, Avg Reward 0.16, Avg Accuracy 0.3, Avg Ep length 4.0, Best Reward 0.112
Time 03h 31m 42s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.0, Best Reward 0.16
Time 03h 38m 36s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.0, Best Reward 0.16
Time 03h 45m 30s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 4.0, Best Reward 0.16
Time 03h 52m 15s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.12, Best Reward 0.16
Time 03h 59m 01s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.0, Best Reward 0.16
Time 04h 05m 44s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.24, Best Reward 0.16
Time 04h 13m 43s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 5.0, Best Reward 0.16
Time 04h 21m 32s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.0, Best Reward 0.16
Time 04h 29m 13s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.0, Best Reward 0.16
Time 04h 37m 07s, Avg Reward -0.08, Avg Accuracy 0.1, Avg Ep length 5.0, Best Reward 0.16
Time 04h 44m 51s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.86, Best Reward 0.16
Time 04h 52m 06s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.52, Best Reward 0.16
Time 04h 58m 46s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.06, Best Reward 0.16
Time 05h 05m 55s, Avg Reward -0.104, Avg Accuracy 0.08, Avg Ep length 4.22, Best Reward 0.16
Time 05h 16m 13s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.16
Time 05h 27m 28s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.16
Time 05h 38m 21s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.0, Best Reward 0.16
Time 05h 49m 23s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 4.0, Best Reward 0.16
Time 06h 00m 09s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.0, Best Reward 0.16
Time 06h 10m 35s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.16
Time 06h 21m 45s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.12, Best Reward 0.16
Time 06h 36m 23s, Avg Reward -0.08, Avg Accuracy 0.1, Avg Ep length 5.88, Best Reward 0.16
Time 06h 51m 05s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 6.0, Best Reward 0.16
Time 07h 04m 16s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 5.12, Best Reward 0.16
Time 07h 18m 24s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 6.78, Best Reward 0.16
Time 07h 28m 26s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 6.74, Best Reward 0.16
Time 07h 36m 14s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.84, Best Reward 0.16
Time 07h 43m 10s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.34, Best Reward 0.16
Time 07h 51m 03s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 4.82, Best Reward 0.16
Time 07h 58m 41s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.64, Best Reward 0.16
Time 08h 05m 15s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.0, Best Reward 0.16
Time 08h 12m 02s, Avg Reward 0.184, Avg Accuracy 0.32, Avg Ep length 4.08, Best Reward 0.16
Time 08h 20m 20s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 5.36, Best Reward 0.184
Time 08h 28m 37s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 5.42, Best Reward 0.184
Time 08h 36m 42s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 5.26, Best Reward 0.184
Time 08h 45m 00s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 5.32, Best Reward 0.184
Time 08h 54m 55s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 6.58, Best Reward 0.184
Time 09h 05m 13s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 7.02, Best Reward 0.184
Time 09h 12m 35s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.66, Best Reward 0.184
Time 09h 20m 36s, Avg Reward -0.08, Avg Accuracy 0.1, Avg Ep length 5.16, Best Reward 0.184
Time 09h 28m 52s, Avg Reward 0.184, Avg Accuracy 0.32, Avg Ep length 5.42, Best Reward 0.184
Time 09h 37m 38s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 5.86, Best Reward 0.184
Time 09h 45m 54s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.42, Best Reward 0.184
Time 09h 52m 41s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.14, Best Reward 0.184
Time 09h 59m 17s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 4.0, Best Reward 0.184
Time 10h 05m 48s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.04, Best Reward 0.184
Time 10h 12m 19s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.0, Best Reward 0.184
Time 10h 18m 39s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 4.0, Best Reward 0.184
Time 10h 25m 00s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.0, Best Reward 0.184
Time 10h 31m 55s, Avg Reward 0.184, Avg Accuracy 0.32, Avg Ep length 4.06, Best Reward 0.184
Time 10h 39m 39s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.184
Time 10h 47m 22s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.9, Best Reward 0.184
Time 10h 55m 23s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.0, Best Reward 0.184
Time 11h 03m 23s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 5.2, Best Reward 0.184
Time 11h 10m 32s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.4, Best Reward 0.184
Time 11h 17m 43s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.24, Best Reward 0.184
Time 11h 25m 51s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.26, Best Reward 0.184
Time 11h 33m 49s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.22, Best Reward 0.184
Time 11h 41m 35s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.184
Time 11h 49m 08s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.0, Best Reward 0.184
Time 11h 56m 36s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.78, Best Reward 0.184
Time 12h 03m 25s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.04, Best Reward 0.184
Time 12h 13m 12s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.0, Best Reward 0.184
Time 12h 23m 39s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 4.0, Best Reward 0.184
Time 12h 30m 18s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.0, Best Reward 0.184
Time 12h 37m 06s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 4.0, Best Reward 0.184
Time 12h 43m 50s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.0, Best Reward 0.184
Time 12h 50m 31s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.0, Best Reward 0.184
Time 12h 57m 05s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 4.0, Best Reward 0.184
Time 13h 06m 19s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 4.12, Best Reward 0.184
Time 13h 18m 38s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 6.12, Best Reward 0.184
Time 13h 31m 20s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 6.2, Best Reward 0.184
Time 13h 45m 22s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.84, Best Reward 0.184
Time 13h 56m 15s, Avg Reward -0.104, Avg Accuracy 0.08, Avg Ep length 4.36, Best Reward 0.184
Time 14h 07m 04s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 4.22, Best Reward 0.184
Time 14h 18m 43s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.82, Best Reward 0.184
Time 14h 30m 49s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 4.94, Best Reward 0.184
Time 14h 43m 07s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 4.86, Best Reward 0.184
Time 14h 55m 36s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.06, Best Reward 0.184
Training thread: 15 Num iters: 1K Avg policy loss: 0.117592454028 Avg value loss: 0.832445306242
Time 15h 07m 37s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 5.0, Best Reward 0.184
Training thread: 3 Num iters: 1K Avg policy loss: -0.130185781124 Avg value loss: 0.736817106047
Time 15h 20m 15s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 5.0, Best Reward 0.184
Training thread: 8 Num iters: 1K Avg policy loss: -0.115444350065 Avg value loss: 0.736987084803
Training thread: 12 Num iters: 1K Avg policy loss: -0.139137712868 Avg value loss: 0.745042469732
Training thread: 13 Num iters: 1K Avg policy loss: -0.087330975621 Avg value loss: 0.74735086819
Time 15h 32m 27s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 5.0, Best Reward 0.184
Training thread: 5 Num iters: 1K Avg policy loss: -0.109482607283 Avg value loss: 0.762932332613
Training thread: 7 Num iters: 1K Avg policy loss: -0.0333308469482 Avg value loss: 0.775354296297
Training thread: 11 Num iters: 1K Avg policy loss: -0.185568212742 Avg value loss: 0.713746232403
Training thread: 10 Num iters: 1K Avg policy loss: -0.0643703758532 Avg value loss: 0.743953702528
Training thread: 6 Num iters: 1K Avg policy loss: -0.260978381567 Avg value loss: 0.682364837718
Time 15h 45m 03s, Avg Reward 0.088, Avg Accuracy 0.24, Avg Ep length 5.0, Best Reward 0.184
Training thread: 0 Num iters: 1K Avg policy loss: 0.00585684951555 Avg value loss: 0.793881461014
Training thread: 14 Num iters: 1K Avg policy loss: -0.193044243555 Avg value loss: 0.703806976835
Training thread: 1 Num iters: 1K Avg policy loss: -0.184739318009 Avg value loss: 0.707147910109
Time 15h 56m 53s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.84, Best Reward 0.184
Training thread: 9 Num iters: 1K Avg policy loss: -0.0644458006085 Avg value loss: 0.785048694798
Training thread: 4 Num iters: 1K Avg policy loss: 0.00307522571788 Avg value loss: 0.817228694934
Training thread: 2 Num iters: 1K Avg policy loss: -0.0580463716025 Avg value loss: 0.795366896593
Time 16h 09m 23s, Avg Reward -0.08, Avg Accuracy 0.1, Avg Ep length 4.94, Best Reward 0.184
Time 16h 22m 41s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 5.64, Best Reward 0.184
Time 16h 38m 12s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 6.64, Best Reward 0.184
Time 16h 54m 38s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 7.22, Best Reward 0.184
Time 17h 10m 50s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 7.0, Best Reward 0.184
Time 17h 24m 54s, Avg Reward 0.136, Avg Accuracy 0.28, Avg Ep length 5.96, Best Reward 0.184
Time 17h 38m 40s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.6, Best Reward 0.184
Time 17h 50m 38s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.0, Best Reward 0.184
Time 18h 02m 37s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 5.0, Best Reward 0.184
Time 18h 13m 26s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.4, Best Reward 0.184
Time 18h 23m 55s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 4.08, Best Reward 0.184
Time 18h 34m 18s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.0, Best Reward 0.184
Time 18h 45m 35s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.58, Best Reward 0.184
Time 18h 57m 34s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 4.86, Best Reward 0.184
Time 19h 09m 44s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.0, Best Reward 0.184
Time 19h 22m 10s, Avg Reward 0.112, Avg Accuracy 0.26, Avg Ep length 5.0, Best Reward 0.184
Time 19h 34m 37s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.0, Best Reward 0.184
Time 19h 47m 00s, Avg Reward -0.056, Avg Accuracy 0.12, Avg Ep length 5.0, Best Reward 0.184
Time 19h 59m 30s, Avg Reward 0.04, Avg Accuracy 0.2, Avg Ep length 5.0, Best Reward 0.184
Time 20h 11m 21s, Avg Reward 0.16, Avg Accuracy 0.3, Avg Ep length 5.0, Best Reward 0.184
Time 20h 23m 52s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.0, Best Reward 0.184
Time 20h 36m 31s, Avg Reward 0.064, Avg Accuracy 0.22, Avg Ep length 5.0, Best Reward 0.184
Time 20h 48m 41s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.184
Time 21h 01m 03s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.0, Best Reward 0.184
Time 21h 13m 27s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.0, Best Reward 0.184
Time 21h 25m 45s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.0, Best Reward 0.184
Time 21h 37m 54s, Avg Reward -0.008, Avg Accuracy 0.16, Avg Ep length 5.0, Best Reward 0.184
Time 21h 50m 15s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.04, Best Reward 0.184
Time 22h 02m 34s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.184
Time 22h 14m 52s, Avg Reward -0.032, Avg Accuracy 0.14, Avg Ep length 5.0, Best Reward 0.184
Time 22h 27m 06s, Avg Reward 0.184, Avg Accuracy 0.32, Avg Ep length 5.0, Best Reward 0.184
Time 22h 39m 22s, Avg Reward 0.016, Avg Accuracy 0.18, Avg Ep length 5.0, Best Reward 0.184
System config:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
It'll be really helpful if you could point me to what could be going wrong in my training procedure.
Regards