It seems to be Invalid JPEG data or crop window
error, but I double-check the image format in my tf records are jpegs, I am wondering any possible reason that could cause this error?
The code I check the image format in tf records:
for tfrecord in tqdm(tfrecord_files):
for example in tqdm(tf.python_io.tf_record_iterator(tfrecord)):
data = tf.train.Example.FromString(example)
encoded_jpg = data.features.feature['image/encoded'].bytes_list.value[0]
img = Image.open(BytesIO(encoded_jpg))
assert img.format == 'JPEG'
The log when I met the error:
E0719 23:46:18.549607 139925925385984 error_handling.py:70] Error recorded frominfeed: From /job:worker/replica:0/task:0:
Invalid JPEG data or crop window, data size 36864
[[{{node parser/case/cond/else/_20/cond_jpeg/then/_0/DecodeJpeg}}]]
[[input_pipeline_task0/while/IteratorGetNext_1]]
E0719 23:46:18.572818 139925916993280 error_handling.py:70] Error recorded fromoutfeed: From /job:worker/replica:0/task:0:
Bad hardware status: 0x1
[[node OutfeedDequeueTuple_4 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]
Original stack trace for u'OutfeedDequeueTuple_4':
File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 156, in <module>
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python2.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 151, in main
run_executer(params, train_input_fn, eval_input_fn)
File "tpu/models/official/mask_rcnn/mask_rcnn_main.py", line 99, in run_executer
executer.train(train_input_fn, FLAGS.eval_after_training, eval_input_fn)
File "/home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py", line 115, in train
input_fn=train_input_fn, max_steps=self._model_params.total_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2721, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 362, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1184, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2560, in _call_model_fn
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1142, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2870, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1943, in create_tpu_hostcall
device_ordinal=ordinal_id)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_tpu_ops.py", line 3190, in outfeed_dequeue_tuple
device_ordinal=device_ordinal, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
E0719 23:46:19.930372 139927321310656 error_handling.py:70] Error recorded fromtraining_loop: From /job:worker/replica:0/task:0:
9 root error(s) found.
(0) Cancelled: Node was closed
(1) Cancelled: Node was closed
(2) Cancelled: Node was closed
(3) Cancelled: Node was closed
(4) Cancelled: Node was closed
(5) Cancelled: Node was closed
(6) Cancelled: Node was closed
(7) Cancelled: Node was closed
(8) Invalid argument: Gradient for resnet50/batch_normalization_32/beta:0 is NaN : Tensor had NaN values
[[node CheckNumerics_98 (defined at /home/panfeng/projects/tpu/models/official/mask_rcnn/distributed_executer.py:115) ]]
0 successful operations.
0 derived errors ignored.