Sometimes the training process (t5.models.mesh_transformer_main) working with a preemptible TPU does not finish (with an error exit code) and freezes.
This is an example of the log.
I0902 03:33:36.516501 140070334211904 basic_session_run_hooks.py:260] loss = 1.109375, step = 488600 (45.410 sec)
INFO:tensorflow:global_step/sec: 2.20221
I0902 03:33:36.518121 140070334211904 tpu_estimator.py:2402] global_step/sec: 2.20221
INFO:tensorflow:examples/sec: 140.942
I0902 03:33:36.518576 140070334211904 tpu_estimator.py:2403] examples/sec: 140.942
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
I0902 03:33:36.520152 140070334211904 tpu_estimator.py:616] Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
I0902 03:33:36.520488 140070334211904 tpu_estimator.py:620] Dequeue next (100) batch(es) of data from outfeed.
INFO:tensorflow:Outfeed finished for iteration (1862, 53)
I0902 03:34:01.018308 140066416998144 tpu_estimator.py:289] Outfeed finished for iteration (1862, 53)
INFO:tensorflow:ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
I0902 03:34:21.925864 140070334211904 session_support.py:391] ShutdownHook: lame workers found: HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0)
INFO:tensorflow:ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
I0902 03:34:21.941661 140070334211904 session_support.py:394] ShutdownHook: saving checkpoint to gs://somewhere/model.ckpt
INFO:tensorflow:No save on shutdown when there are user-defined CheckpointSaverHooks
I0902 03:34:21.942317 140070334211904 tpu_estimator.py:2370] No save on shutdown when there are user-defined CheckpointSaverHooks
INFO:tensorflow:Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
I0902 03:34:21.942646 140070334211904 session_support.py:150] Shutting down HeartbeatManager(/job:worker/replica:0/task:0/device:CPU:0).
INFO:tensorflow:Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
watchdog_config {
timeout_ms: 60000
}
exit_code {
exit_code: 42
}
I0902 03:34:21.943512 140070334211904 session_support.py:104] Configuring worker heartbeat: shutdown_mode: SHUTDOWN_AFTER_TIMEOUT
watchdog_config {
timeout_ms: 60000
}
exit_code {
exit_code: 42
}
INFO:tensorflow:Waiting 70.00 seconds for worker shutdown.
I0902 03:34:21.945668 140070334211904 session_support.py:159] Waiting 70.00 seconds for worker shutdown.
INFO:tensorflow:Resetting coordinator.
I0902 03:35:32.017142 140070334211904 session_support.py:423] Resetting coordinator.
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session w
ill be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeat
edly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker shutdown.
I0902 03:35:32.020745 140070334211904 monitored_session.py:1286] An error was raised. This may be due to a preemption in a connected worker or parameter server. The c
urrent session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in t
he parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Resetting session loop due to worker
shutdown.